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Abstract 

We consider a general incompressible finite model protein of size M in its environment, which we represent by a semiflexible 
copolymer consisting of amino acid residues classified into only two species (H and P, see text) following Lau and Dill. We 
allowing various interactions between chemically unbonded residues in a given sequence x an d the solvent (water), and exactly 
enumerate the number of conformations W(E) as a function of the energy E on an infinite lattice under two different conditions: 
(i) we allow conformations that are restricted to be compact (known as Hamilton walk conformations), and (ii) we allow 
unrestricted conformations that can also be non-compact. It is easily demonstrated using plausible arguments that our model 
does not possess any energy gap even though it is supposed to exhibit a sharp folding transition in the thermodynamic limit. 
The enumeration allows us to investigate exactly the effects of energetics on the native state(s), and the effect of small size on 
protein thermodynamics and, in particular, on the differences between the microcanonical and canonical ensembles. We find 
that the canonical entropy is much larger than the microcanonical entropy for finite systems. We investigate the property of 
self-averaging and conclude that small proteins do not self-average. We also present results that (i) provide some understanding 
of the energy landscape, and (ii) shed light on the free energy landscape at different temperatures. 
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I. INTRODUCTION 

A. Proteins as Semiflexible Heteropolymers 

Proteins are organic compounds made of amino acids, 
also known as residues, bound in a chain-like structure 
by peptide bonds. Self-assembling small proteins can fold 
into their native states (of minimum free energy) without 
any chaperones, and have been extensively investigated 
recently using lattice models by thermodynamic princi- 
ples [if. They differ from flexible polymers, which col- 
lapse to a compact disordered state; they are similar to 
semiflexible polymers in which semifiexibility forces an 
ordered (crystalline) compact structure at low tempera- 
tures 0. 

Let Nji denote the total number of residues in N pro- 
teins in a volume V; the residue concentration is 

c = N R /V. 

To ensure that the boundary of the volume V does not 
affect the behavior of the system, we need to take the 
limit V — ► co. This limit will be usually implicit in the 
following, unless mentioned otherwise. In many cases, we 
deal with a dilute solution so that the concentration of 
proteins is exceedingly small. Accordingly, the proteins 
are far apart with no appreciable inter-protein interac- 
tions. It is then safe to consider a single protein by itself 
in its environment, i.e. in the presence of water. The 
presence of inter-protein interactions in a solution, which 
is not dilute, and in a bulk means that these systems 
(both of which we will not consider in this work) con- 
taining many proteins should be distinguished from that 
containing a single protein, as their thermodynamics will 
be very different. 



1. Protein as a Small System 

Our focus in this work is on a single protein (N = 1) 
containing M residues so that Nr — M. As proteins 
are usually small in size, we need to recognize that the 
behavior of a single protein is governed by the thermody- 
namics of a small system (defined as a system in which 
N-r does not grow with the volume V as V — > oo) and 
not of a macroscopic system, such as formed by a bulk (in 
which ATr = NM grows with the volume V); the latter 
will be governed by the thermodynamics of a macroscopic 
system Q • It is well known that predictions of different 
ensembles describing a macroscopic system are the same, 
except at some singular points such as where phase tran- 
sitions occur. Therefore, it is important to understand 
the ways in which different statistical ensembles differ 
from each other for small systems. This is one of the 
important issues motivating this investigation: how to 
distinguish small system thermodynamics from a macro- 
scopic system thermodynamics in various ensembles. For 
this purpose, it is sufficient to consider only two ensem- 
bles: the microcanonical (ME) and the canonical (CE) 
ensembles. 



2. Structures and the Standard Model 

The residue sequence (known as the primary structure) 
in a protein is defined by a gene and is encoded in the 
corresponding genetic code. Understanding the relation- 
ship between the sequence and protein functionality is an 
unsolved problem though major progress has been made 
[H]. A first-principle study of primary, secondary (reg- 
ularly repeating local structures, such as helices and /3- 
sheets) and tertiary (the overall shape or conformations 
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of a single protein) structures requires short (local) and 
long (nonlocal) ranged model energetics that, while re- 
maining independent of protein conformations, temper- 
ature and pressure, determines the native state(s), and 
has to be judiciously chosen to give a unique and correct 
native state @. 

The simplest model that can be used is the standard 
model of Lau and Dill 6j , which classifies the 20 different 
amino acid groups or residues into two subsets, H (hy- 
drophobic residues) and P (hydrophilic/ 'polar residues), 
and allows only nearest-neighbor attractive HH interac- 
tion (whose strength is set equal to 1 in some predeter- 
mined unit) to provide good hydrophobic cores; however, 
consideration of local energetics of the 20 residues Q is 
also common. It is also found that the introduction of 
multi-body interaction enhances cooperativity [8|], and 
should not be neglected. 

The protein in the standard model is an example of a 
copolymer of a prescribed sequence. It is this simplified 
copolymer model and its variants proposed in this work 
that will be the subject of investigation here, even though 
the work can be extended to a more general case. 



B. Energetics and Energy Distribution W(E) of 
Conformations 

1. Microscopic Interaction Energies 

The microscopic energies that appear in the model en- 
ergetics, while determining the thermodynamics, must 
themselves be independent of the thermodynamic state, 
i.e., of protein conformations, temperature, pressure, 
concentration, etc. to be truly microscopic. In addition, 
a proper model should satisfy certain principles Q, one 
of which is the requirement of cooperativity needed for 
the existence of a first-order transition (a latent heat) at 
the folding transition to the native state. The residue se- 
quence plays an important role in determining the native 
state [l(| and, therefore, the thermodynamics. Thus, we 
are driven to treat proteins as semiflexible heteropoly- 
mers with certain specific sequences However, there 
is no consensus for general energetics to describe all pro- 
teins, and there remains a certain amount of freedom in 
the choice for a theoretical investigation. It is widely rec- 
ognized that secondary structures are also important in 
the folding process Q, yet they are not always incorpo- 
rated in determining the energetics. 

In view of the above discussion, it is important, there- 
fore, to investigate the effects of energetics on the behav- 
ior of small proteins, an issue that, to the best of our 
knowledge, has not been studied fully. 

Protein stability and function are the results of exten- 
sive evolutionary changes. In other words, the natural 
evolution has over a long period eventually found the 
most optimal energetics for an individual protein with 
a given sequence to fold fast into its native state. The 
energetics must be tuned to the particular sequence in 



addition to the protein structure] the latter is defined 
as a particular conformation of the protein alone with- 
out any regard to the surrounding environment or the 
sequence. Thus, the study of the structure without ac- 
counting for the environment such as water inside the 
cell or the sequence will not provide a complete under- 
standing of protein thermodynamics. This is because the 
true interactions of a real protein determine the equilib- 
rium structure for a given sequence. For the energetics to 
be truly microscopic, it must also be independent of the 
sequence. This means that not all sequences will form 
natural proteins. 



It has also been argued that conflicts among interac- 
tions also play a significant role in folding fl2| . The in- 
terplay of intra-protein molecular interactions, the inter- 
action with the surrounding, and the residue sequence 
to give rise to the folded native state is quite intricate 
and far from being understood. A complete understand- 
ing will enhance not only our ability to find cures, but 
also to design proteins with a desired behavior. For this, 
we need a true appreciation of the underlying molecu- 
lar interactions and the resulting thermodynamics, not 
specific to a particular folding. This is a key ingredient 
in obtaining a detailed understanding of folding, as the 
energetics determines the energy landscape that presum- 
ably dictates the path to folding. 



As the knowledge of the general energetics that con- 
trols folding in all proteins is an unsolved problem, 
progress can only be made by constructing a model or 
models with a goal to explain some desired or important 
features of the folding process as is common with any 
complex physical system. In general, the model should 
contain various interactions relevant not only for vari- 
ous secondary substructures like helix formation in the 
native state, but also for proteins as semi-flexible het- 
cropolymers. 



For the standard model and its variants that we con- 
sider here, the proteins arc treated as semiflexible copoly- 
mers. The model should also contain solvation effects, as 
all protein activity occurs in the presence of water or sol- 
vent. The compressibility also plays an important role. 
However, as we will discuss later, this makes the prob- 
lem very complicated. Therefore, in this work we only 
consider an incompressible model, and propose such a 
model and investigate its behavior in different limits, one 
of which is the standard model described above. How- 
ever, the central focus of the work remains to be the 
investigation of small system thermodynamics, since pro- 
teins form small systems( M < oo). We will demonstrate 
that the thermodynamics of small proteins differs from 
that of its macroscopic analog in some unexpected but 
substantial ways. 
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2. Energy Distribution W(E) and Small System Thermo- 
dynamics 

Pairwise residue contact energies or potentials are com- 
monly used in theoretical studies of protein folding as an 
important simplification because of the complexity of the 
problem. These potentials are derived from the knowl- 
edge of conformations in the crystal structures of proteins 
in the protein data bank, but the procedure comes with 
serious limitations [5j. One such limitation is the small 
number of conformations that describe the ordered state 
of the protein. A better way would be to use all the con- 
formations W > 1 of the protein [l3| . This requires the 
determination of the distribution W{E) > 1, the number 
of the conformations of a given energy E [14| • 

Once W{E) is known, the complete thermodynamics 
is determined. This is certainly believed to be true for 
macroscopic systems, systems in which the volume of 
the system becomes macroscopically large to suppress 
boundary effects, while keeping the density of participat- 
ing particles such as c fixed in the limit; in mathematical 
terms, the volume must diverge to infinity (thermody- 
namic limit) 

Conjecture 1 We will take the viewpoint that W(E) 
also provides the complete thermodynamics for small sys- 
tems Jj/. 

We will demonstrate, however, that care must be ex- 
ercised as not all that is valid for a macroscopic system 
remains valid for small systems. It should be stressed 
that W(E) depends on the par ticular sequence x °f * ne 
residues, even if W does not [l3| . An interesting question 
arises about the property of self-averaging in heteropoly- 
mers [ill [ljl fl7| ; see Sect. QV] for details. For small 
proteins, there is evidence that certain properties of in- 
terest depend on the sequence \ m important ways [l6j |. 

C. Exact Approach for small Proteins 

Usually, one attempts to determine the distribution 
W{E) by carrying out several simulations. Because of 
the limitations inherent in the simulation, an alternative 
approach is to determine W(E) by exact enumeration 
on a lattice. Such enumerations allow us to do exact cal- 
culations; no approximation has to be made. This has 
the added benefit that we can verify various conjectures 
about the form of entropy, self-averaging, landscape, etc. 
The enumeration is, however, feasible only for short pro- 
teins. The smallest known natural protein (at least to 
us) is Trp-Cage derived from the saliva of Gila monsters. 
It has only 20 residues. Our approach is to consider the 
protein to be a small thermodynamic system containing 
M < oo residues or ammino acids [sj, even if the lattice 
on which it is embedded is infinite. (As discussed later, 
we cut down the number W(E) by rooting the protein by 
fixing one of its end at the origin of the lattice and ex- 
ploiting some symmetry properties.) This approach also 



allows us to investigate how the thermodynamics of small 
proteins differ from that of macroscopic polymers, with 
some unexpected results. In particular, we need to recog- 
nize that small proteins cannot undergo a sharp (i.e., dis- 
continuous or first-order) folding transition. Thus, there 
will, in principle, be no latent heat. One can only look 
for some unambiguous signature of a latent heat (i.e., of 
cooperativity), which can justify a sharp transition in the 
thermodynamic limit of a macroscopic protein. We must 
also consider the effects of residue sequences on the de- 
generacy of the lowest energy state and the nature of any 
possible transition in the thermodynamic limit. 



D. Layout 

The layout of the paper is as follows. In the next sec- 
tion, we provide a discussion of the required thermody- 
namic background to appreciate what may happen differ- 
ently for small systems compared to a macroscopic sys- 
tem. In Sect. Ill, we discuss a very general incompress- 
ible lattice model of a protein of a given sequence. The in- 
comprcssibility brings about certain simplifications as we 
will discuss later. We will only consider a small protein. 
We introduce three models that include the standard 
model and two variants due to weak and strong perturba- 
tions. We consider random, ordered and fixed sequences. 
We consider compact conformations or all conformations 
(compact and non-compact) separately, and label them 
as restricted or unrestricted to distinguish them. In the 
following section, we discuss the issue of self-averaging 
and test it for small proteins. In Sect. V, we study the 
effects of energetics on native conformations. In Sect. VI, 
we introduce small system entropies in the microcanoni- 
cal and canonical ensembles, and discuss various thermo- 
dynamic laws that remain valid for small systems. In the 
following section, we compare the entropies in the two 
ensembles. In Sect. VIII, we study various densities and 
the specific heat. We introduce the notion of a distance 
in Sect. IX and use this to project the multi-dimensional 
configuration space onto a two-dimensional space from 
which we draw some conclusions about the configuration 
space and the landscape. We construct the free energy 
landscape from our numerical results in Sect. X. The last 
section contains a brief summary and discussion of our 
results. 



E. Results 

1. We show that the conformations associated with 
native states of a given fixed energy depend on the 
residue sequence. 

2. Under very mild assumptions, we show that there 
is no energy gap in our model of a macroscopic 
protein; see Sect. MIDI 
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3. The self- aver aging does not seem to occur in small 
proteins, at least for the native state energy, so that 
the sequence x plays an important role; see Sect. 

ED 

4. Different energetics can give the same native state 
(Sect. EI). 

5. For small proteins, the entropy and energy densi- 
ties are not only discrete but also depend on M 
strongly; see Sect. IVI A II In addition, the entropy 
density s(e) is higher for larger M over a wide range 
of energies; see Sect. IVI A 11 

6. Justification for using the Boltzmann entropy and 
the Gibbsian entropy and the partition function 
formalism for small system is given in Sect. IVI El 
We follow this approach in this investigation. 

7. For small systems, we prove that S(E) > S(E) 
where S(T) — S(E) is the canonical or the Gibb- 
sian entropy at T, while S(E) is the Boltzmann 
entropy at the average energy E; see Sect. IVII Al 
For a macroscopically large system, the two en- 
tropies are the same. We also prove that S(E) is a 
concave function, but S(E) is not. 

8. One cannot trust the Gaussian form of the ME en- 
tropy following the random energy model, as it pre- 
dicts a vanishing entropy at an energy above the 
native state, thereby suggesting an energy gap and 
a frozen native state, both of which are not correct 
for a finite protein; see Sect. IVII Dl 

9. The net effect of the perturbations is to make the 
native state more robust to perturbations: Stronger 
the perturbation is, more robust the native state is 
to the perturbation, i.e., it has less excitations. See 
Sect. IVIIIBI 

10. The behavior of the specific heat suggests a discon- 
tinuous folding transition; see Sect. IVIII CI 

11. The two-dimensional projection of the energy land- 
scape C2S is more symmetric than C20; see Sect. 

ED 

12. The energy landscape for the standard model has 
energy barriers in the radial direction for only low- 
lying microstates; see Sect. IIXBI 

13. The energy landscape may not be relevant for fold- 
ing in small proteins; see Sect. IPC El 

14. The thermodynamic relation dS{E)/dE = 1/T for 
the microcanonical entropy S(E) is not valid for 
small proteins; see Sect. IX CI 



II. THERMODYNAMIC BACKGROUND 
A. Configurational Approach on a Lattice 

1. Configurational Partition Function 

In classical statistical mechanics, the canonical parti- 
tion function, the partition function (PF) in the canonical 
ensemble, factors into two independent factors: one fac- 
tor depends only on the kinetic energy, and the second 
factor depends only on the interaction energy, provided 
the interactions do not depend on particle momenta as 
happens with magnetic interactions; see [H[ for a recent 
discussion of this issue. The same is true of other ensem- 
bles; however, we are only going to consider the micro- 
canonical and canonical ensembles in this work We will 
assume here that factorization occurs. This factorization 
establishes a very important aspect of classical statistical 
mechanics: the free energies corresponding to the two fac- 
tors are additive. Thus, one can study them separately. 
Furthermore, since the contribution from the kinetic en- 
ergy is independent of the interactions, it has no bearing 
on studying energetics. Because of this, one needs to fo- 
cus only on the second factor, commonly known as the 
configurational partition function, and totally disregard 
the kinetic energy of the system. This allows us to con- 
sider a lattice model where the focus is on the configura- 
tional partition function, since there is no kinetic energy 
in a lattice model. On a lattice, therefore, the entropy 
refers to the configurational entropy. In the context of a 
single protein investigation, it is commonly known as the 
conformational entropy. The volume V of the system is 
then determined by the number of lattice TVl sites on the 
lattice. We will set the lattice spacing a = 1 in some pre- 
determined unit of the length so that V = N^a 3 = Ni,, 
where a 3 is the lattice cell volume. For general dimension 
d, we have V = N L a d = N L . 

The absence of kinetic energy does not mean that dy- 
namics cannot be studied on a lattice. All one needs to 
do is to introduce some configurational moves to change 
one configuration into another. This is quite common in 
a lattice investigation of any physical model. However, 
we are not interested in studying dynamics in this work. 



2. Most Probable and Average Energies May Not be Same 

The total number of conformations W of a rooted pro- 
tein with a given number M of residues depends only on 
the lattice geometry, the boundary conditions imposed 
on the lattice, and M [HI]. For a small protein, W is 
most certainly finite. It also does not depend on the se- 
quence of the residues fl3j |. regardless of the size of the 
protein, even though W(E) does depend on the sequence 
strongly. This is an important observation, as its im- 
plications are not well appreciated. At sufficiently high 
temperatures, a protein will explore almost all the confor- 
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mations, regardless of the model energetics. It is only at 
lower temperatures that the energetics allow the protein 
to explore only a selected set of conformations W(E) of 
a given average energy E that itself depends on the tem- 
perature. It is a well-known fact that the average energy 
is the energy of the most probable conformations, and 
that the average energy is also the most probable energy. 
If the energetics strongly favors the native state, such as 
in the Go model [l9| . then the majority of the conforma- 
tions are going to resemble the native conformation(s). 
Thus, the number of probed configurations is expected to 
be smaller in such models, which will then provide a very 
efficient way to approach the native state by reducing the 
configurational search [20| . 

3. Twists due to the small size 

However, there are two twists. The above reasoning 
is justified from a thermodynamic point of view only if 
the system is macroscopically large as we have recently 
pointed out [21, 22]. This is not true of a protein, which 
constitutes a small system due to its small size. This 
point will be discussed further below. The other twist 
has to do with the existence of cooperativity or a first- 
order folding transition in such models. Not all energetics 
and/or sequences will give rise to such a folding transition 
to an ordered state. 



B. Small System Discreteness and the Thermody- 
namic Limit 

1. Configurational Space discretization 

It should be stressed that the evaluation of the number 
W, an integer quantity, requires some sort of discretiza- 
tion of the configurational space. In the absence of any 
discretization, the entropy in classical statistical mechan- 
ics will always be infinite due to the continuum nature 
of the space. It is only when we use quantum statistical 
mechanics that the entropy can be properly calculated. 
However, at present, there is no hope of studying a sin- 
gle protein using quantum statistical mechanics, and we 
are forced to confine ourselves to the classical statistical 
mechanics. Thus, a lattice formulation allows us to cal- 
culate the entropy, and not only just the change in the 
entropy [l8j . 

For a lattice model, the configurational energy E is 
going to be discrete in that the difference AE between 
two neighboring energies is going to be a finite, but non- 
zero quantity. In addition, for a small protein, Ae = 
AE/Nn per residue will also remain non-zero; recall that 
for a single protein, ATr = M. Therefore, the energy 
spectrum will be discrete, whether we consider the energy 
E or the energy 

e(JV R ) = E/N R 



per residue. It is only in the limit of an infinitely large 
macroscopic system (JVr — > oo, with the understanding 
that A^l > Ar so that the proteins can be accommodated 
on the lattice) that the energy per residue will give rise 
to a continuum spectrum [23|. In addition, it is in this 
limit that e also becomes independent of ATr [l3[; see 
Fig. Q] later for direct evidence for a single protein case. 
As long as we are dealing with a small protein, we are 
forced to consider a discrete spectrum of e(Nji) or E. 
Consequently, W(E) is a discrete function of E, and as 
said above, e(N R ) continues to depend on Ar [3] for 
finite A/r. 

2. Thermodynamic Limit 

To obtain a proper thermodynamic description which 
is insensitive to the boundary (i.e., surface) effects, we 
need to consider a macroscopically large volume (ATl — > 
oo). This limit by itself does not automatically require 
the limit A/r — > oo, as long as A^l > ATr. The proper 
thermodynamics is obtained formally by taking the ther- 
modynamic limit, which requires considering a macro- 
scopically large volume (V — > oo), such that the residue 
density c (per unit volume) and the energy density e (per 
residue) are either fixed or reach their respective limits 
that are independent of A^r. At this point, we need to 
emphasize that a clear distinction between a single pro- 
tein (finite M) and its bulk counterpart (which we do not 
consider in this work) containing many proteins should 
be made, as their thermodynamics would be very differ- 
ent. The thermodynamic limit for the bulk containing 
a large number of fixed size proteins, each in a given 
sequence %, requires the number of proteins to increase 
with the volume to keep the residue density c fixed. In 
the simultaneous limit N R — > oo, V — > oo, such that the 
limiting densities c > 0, and e, both of which are con- 
tinuous, are kept fixed, E becomes infinitely large, and 
one cannot use it or other extensive quantities (which are 
also infinitely large) to study thermodynamics [231 ] in this 
limit; one must consider corresponding densities, which 
remain bounded. The standard approach is to consider a 
sequence of systems of increasing volume Vk constructed 
so that the resulting sequence of densities {cfc} , {efc} con- 
verge to their respective limiting densities 

{c k } -> c, 
{e k } -> e 

in the thermodynamic limit. This approach is equivalent 
to the following alternative description commonly em- 
ployed in thermodynamics. In this approach, one consid- 
ers finite extensive quantities such as the configurational 
energy E by considering a large but finite size system 
containing N R residues in a finite volume V. The con- 
figurational energy E of the system is almost identical 
to 

E = N R e, A/r<oo. (1) 
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Here e is the energy per residue in the thermodynamic 
limit Nji — > oo as shown above. The accuracy of ([TJ in- 
creases as N-r increases, and ensures that E is in general 
bounded (ATr < oo) and can be approximately treated as 
a continuous variable since AE = MAe = [23|, which 
follows from the fact that e is continuous. This is the 
case, for example, for the random energy model to be 
discussed below. However, even in this approach, one 
formally needs to take the limit as ATr — > oo to properly 
treat e as a continuous variable, but is never done in prac- 
tice as the system under consideration is finite though 
large. Since E and other extensive quantities are now 
approximately treated as continuous variables, though 
they are finite in magnitude, one can carry out thermo- 
dynamic investigation which requires taking derivatives 
of various (continuous) functions. 



3. Single Protein as a Small System 



The limit, however, causes a very serious problem when 
we wish to consider a single protein, which is charac- 
terized by ./Vr = M and \- To maintain a fixed non- 
zero density c, we need to consider the protein size M to 
also increase with the volume. Thus, the thermodynamic 
limit will require M to diverge simultaneously with the 
volume of the system. This also means that the sequence 
X will also change. If it happens that the sequence is rele- 
vant in determining thermodynamics, then we are dealing 
with different proteins as M increases. For example, the 
energy is usually determined not only by M but also by 
the sequence x- The sequence x associated with a protein 
of size M will be different for different M and also from 
that of a protein of an infinite size. The way to avoid this 
problem is to fix both M and x and let the volume diverge 
Q so that the boundary effects become irrelevant. In this 
case, c — > in the limit, but E remains bounded and dis- 
crete. Therefore, in the following, we will consider our 
system to consist of a small protein of size M in a given 
sequence x- However, we let V — > oo, so that our system 
forms a small system in which E remains bounded. The 
same holds for all other extensive quantities [24j in the 
following for our small system. In the rest of the work, 
all extensive quantities must be interpreted in the above 
sense, even though the volume or the size of the lattice 
may be infinite large. Thus, M — > oo is never going to be 
implied in the following whenever we talk about a small 
system. This should cause no confusion. As we will see 
below, the incompressibility condition allows us to take 
the volume infinitely large for any M. 

From now on, we will only consider a single protein 
system, unless specified otherwise. 



C. Energy Landscape, Conformation Space and 
"Distance" between Conformations 

The number W(E) (or W(E)dE for continuous energy 
spectra) also characterizes the potential energy landscape 
for the protein [25|, [26|, [27| , which has become very useful 
for describing the equilibrium properties. Each confor- 
mation of the protein of energy E is represented by a 
point of energy E on the energy landscape. The num- 
ber of such points is precisely W(E) (or W(E)dE for the 
continuum case) and represents the element of the "hy- 
persurface area" of energy E. The entire "hypersurface 
area" of the landscape directly determines the number of 
conformations W [2l|. The native state(s) represents the 
global minimum (minima) of the landscape. The projec- 
tion of the energy landscape in the direction orthogonal 
to the energy axis represents the conformation space C of 
the protein. Each point in the conformation space rep- 
resents a conformation of the protein, and its energy is 
given by the height of the point on the energy landscape 
directly above it in the direction of the energy axis. As 
discussed above, the energy is a discrete variable on a 
lattice, so that W(E), and therefore the entropy are also 
discrete functions [23]. For a macroscopic system, one 
can usually treat both as continuous. But this is not 
possible for a small system. Thus, the concept of the po- 
tential energy landscape must be modified in important 
ways. In particular, the investigation of the landscape re- 
quires knowing the " distance" between conformations in 
the conformation space C. While this distance is trivial 
to define for monomeric systems, this is not so for a poly- 
meric system due to its connectivity. Thus, one of our 
tasks would be to introduce the concept of a " distance" 
between different conformations of a protein. In particu- 
lar, we need to define a "distance" for all conformations 
from the native state or from various native states. The 
notion of a "distance" allows us to partially understand 
why a protein in a given conformation may not fold into 
its native state when its energetics or its sequence has 
been altered due to a disease or some other reasons. 



D. Pathways 

To understand the dynamics of protein folding, we fol- 
low Anfmsen [l[. According to Anfinsen, proteins get 
into their native state following a time-ordered sequence 
of conformations, now called a "pathway". The pathway 
may have a fractal nature [281 ] and is supposed to dictate 
the kinetics of protein folding. Two consecutive confor- 
mations r at time t and T' at the next time t + At in the 
pathway must differ by some local movements, provided 
At is chosen sufficiently small to allow only for some lo- 
cal movements of the protein. Thus, the concept of a 
" distance" between two conformations must be such that 
a small distance between two conformations is consistent 
with allowing a conformation to turn into a " nearby" con- 
formation using only a few local movements. It is easy 
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to be convinced that because of the connectivity of the 
protein, such local movements can most often occur near 
the ends of the protein, but not so often in its interior. 
The movement at an interior point (away from the ends) 
would most often require a large portion of the protein 
from the interior point to the end to participate in a co- 
operative movement. This must require a much longer 
time duration than the smallest time interval At chosen 
above. However, some local internal movements such as 
a reflection along a diagonal of the square cell, is possible 
between nearby conformations. 

Usually, in the folding problem, one is interested in 
following the pathway to the native conformation from 
a nonnative conformation of much higher energy. Thus, 
the entire pathways would correspond to an eventual low- 
ering of the energy. However, there is no guarantee that 
r' will always be of a lower energy than T. There is 
also no guarantee that T' will be closer (in distance) to 
the native state than 1\ The only constraint is that T 
and r' are close in distance. It is possible that two con- 
formations are closer in energy but have much different 
distances from the native state. Thus, the "distance" and 
energy are going to be independent. The pathway most 
certainly will include non-native contacts, which disap- 
pear as the protein gets into its native state. It will also 
depend crucially on various energies in the model, since 
the energetics uniquely govern the partitioning of W into 
a distribution W(E) of the number of conformations of 
energy E on the energy landscape: 

W = J2 W ( E ) > L ( 2 ) 

E 

Different energetics will usually lead to different path- 
ways. Thus, it is possible to extract information about 
energetics from a knowledge of pathways. 

A pathway will contain conformations that are not all 
going to be compact, so the aqueous interactions will also 
play an important role in determining the pathway along 
with other bonded and non-bonded interactions. As the 
relative strengths of various interactions change, so do 
the partitioning of W in the distribution W(E): wiffer- 
ent models will assign different energies to various con- 
formations with the result that different conformations 
contribute to W(E). 

E. Random Energy Model of a Macroscopic Sys- 
tem and Concavity of its Entropy 

1. Random Energy model 



where A, a, and E are constants (3l|. In general, A 
depends exponentially and a inversely on the size M of 
the protein: 

In A oc M, a oc 1/M. (4) 

This ensures that W(E) grows exponentially with M. It 
is easy to envision situations, however, in which one can 
obtain non-Gaussian distributions with unusual proper- 
ties, not commonly associated with such a distribution. 
In particular, some distributions would be completely ir- 
relevant for proteins. Hopefully, some energetics will al- 
low the model protein to behave like a real protein. The 
current investigation is a first step towards identifying 
such realistic energetics. 

2. Entropy Concavity 

The configurational entropy in the random energy 
model, following the Boltzmann relation 

S(E) = \nW(E), (5) 

is given by 

S{E) = \nA-a{E- E) 2 ; (6) 

see ([3]) ; both terms in ^ are extensive. The form of this 
entropy is an inverted parabola so that it is concave [30j | . 
Mathematically, this requires 

d 2 S/dE 2 < (7) 

for a macroscopic system to ensure its thermodynamic 
stability. Observe that E is where the entropy has its 
maximum. It should be noted that the number of states 
W(E) in vanishes at the extremes of the allowed en- 
ergies [31j| . In these neighborhoods, S(E) becomes nega- 
tive. To avoid a negative S(E), one uses the above form 
over the range 

(E — a, E + a), a = \/ln A/a, 

where a is extensive so that S(E) is non-negative over 
this range, and supplements it by S(E) = outside this 
range. In the following, we will only focus on the low 
energy range. 



A common distribution is the Gaussian distribution 
of the random energy model [2^ | , which has been exten- 
sively employed for proteins (see [27l | for example), and 
which will be discussed later in the work. In this model, 
W(E) is given by the following continuous function of 
the continuous variable E 



W(E) = Aexp —a(E — E) 



(3) 



3. Energy Gap 

The supplementary function S{E) = requires mak- 
ing the assumption that the lowest allowed energy Eq in 
the energy spectrum is below the lower end of the above 
range: 

E < E G ee E - a. 
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This assumption gives rise to an energy gap between Eq 
and Eq, the width of the gap itself being extensive. The 
presence of the energy gap makes the modified entropy 
function convex in the region about Eq. It is this modi- 
fied form of the random energy model that has been ex- 
tensively used in studying protein folding; the resulting 
concavity violation around Eq is interpreted as a folding 
transition, as we will show below. The folding tempera- 
ture Tp is given by the inverse of the slope of the tangent 
drawn from Eq so that it touches the entropy function 
©; see [27j for example. The modified Gaussian model 
also shows that the energy gap above the ground state is 
crucial for foldability. It should be noted, however, that 
there are idealized physical models, such as the KDP 
model, that freeze into the ground state through a first- 
order transition at a finite non-zero temperature [32[ |33| , 
something similar to the protein folding. 

It is well known that the energy gap in the KDP model 
is extensive in size just as in the random energy model. 
It is this extensive size of the gap that makes the macro- 
scopic entropy non-concave in the neighborhood of the 
gap in the random energy model. 

The temperature at which S(E) vanishes represents 
the ideal glass transition temperature Tq. The ideal glass 
is a frozen state of zero entropy and exists below this 
temperature and has a constant energy Eq > Eq and 
zero specific heat. 

4. Equality of S(E) and S{T) 

The Gaussian form ((6]) of the entropy has been used to 
suggest the following form of the average energy E [34j : 

E = E-l/2aT (8) 

above the folding temperature. As we will see later, this 
form of the energy can be justified for a macroscopic sys- 
tem. This form cannot apply near absolute zero where 
it becomes unbounded, and the problem is avoided by a 
folding transition. As a sharp folding transition cannot 
occur in small proteins, it is also desirable to understand 
the limitation of the above energy form for small proteins. 

The Hclmholtz free energy F(T) is obtained by evalu- 
ating F(T) = E- TS(E), and is given by 

F(T) =E-T\n.A- 1/iaT, (9) 
from which S(T) can be obtained directly, see below (TTij)) : 
S(T) = In A- 1/AaT 2 , (10) 

so that 

S(E) = S(T), (11) 

see ©, as said above. The ideal glass temperature is 
given by 

T G = \/2aa. 



This equality is only valid for a macroscopic system 
and, as shown recently [13] and will also be discussed fur- 
ther in this work, does not hold for small systems such as 
a finite protein that is of our interest here. Their equal- 
ity, however, is crucial as direct experimental approaches 
(such as crystallography or NMR techniques) are used to 
provide information about the typical conformations as- 
sociated with the average or most probable energy. Thus, 
it is also important to know if the two concepts of entropy 
are equivalent for small proteins. If not true, the inter- 
pretation of experimental data for the energetics would 
be incorrect. This will become a limitation of any direct 
experimental technique in determining the energetics and 
its association with conformations. 



5. Limitations of the Model 

The random energy model can be justified for a macro- 
scopic system by appealing to the central limit theorem 
and assuming that various energies are random variables. 
Accordingly, this model is not applicable to small pro- 
teins. Therefore, it is far from obvious how relevant the 
random energy model is for small proteins. Moreover, 
there are other limitations of the model in addition to 
those noted in [3l[ . One of the problems with the ran- 
dom energy model becomes evident from its free energy 
([9]), which does not reduce to E at absolute zero as re- 
quired by thermodynamics. Note that the free energy 
continues to satisfy the condition of stability everywhere 

d 2 F/0T 2 < 0, 

which follows from the non- negativity of the specific heat. 
Therefore, the above thermodynamic violation is not a 
consequence of any thermodynamic instability. The vi- 
olation has to do with its unphysical entropy in pop , 
which does not satisfy the thermodynamic requirement 
TS{T) -> as T -> To avoid the above violation, 
a first-order folding transition is invoked at T = T-p given 
by 

F(T F ) = E . 

Above Tp, one uses the free energy ©, and below Tp 
one uses F(T) = Eq. The folding transition is in reality a 
freezing transition in that the low-temperature phase is a 
frozen state of zero specific heat, similar to the ideal glass, 
except that the ideal glass has a much higher energy Eq 
due to the energy gap discussed above. It should be clear 
that Ep = E(Tp) > Eq. However, it should at the same 
time be stressed that the energy gap is not present in the 
random energy model, but has been put in "by hand" 
to avoid a negative S(E). This energy gap then makes 
the entropy S(E) non-concave, which is then responsible 
for the first-order folding transition. If there were no 
energy gap, i.e. if Eq > Eq, then there would be no 
loss of concavity. In that case, there would be no folding 
transition. However, the condition Eq > Eq would make 
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the model quite unphysical as no equilibrium state would 
exist in the model below a non-zero temperature at which 
E = Eq, but the entropy is not zero. 

It should be noted that the random energy model it- 
self does not specify the value of Eq. Indeed, ([3]) is valid 
for all E > — oo. This suggests that Eq — > — oo . If 
this is accepted, then the tangent construction to locate 
the folding temperature will give Tp — > oo. This is not 
meaningful. For a meaningful discussion, we need the 
following conjecture. 

Conjecture 2 We need to treat Eq as finite. 

This should not come as a surprise. Indeed, it follows 
from our earlier discussion of the energy in ([T|) . We need 
to apply the random energy model to a finite but large 
system so that Eq can be treated as finite. 

At the same time, a physical requirement for W(E) 
is that for allowed energies, W(E) > 1. If this is taken 
literally [3l|, then ([3]) must be restricted to the energies 
in the range (E — a, E + a), so that the lowest allowed 
energy is Eq — Eq. In this case, there will not be any 
energy gap, and no loss of concavity. This is usually not 
the interpretation adopted in the literature. Invariably, 
one adopts the conventional choice Eq < Eq, the actual 
value of Eq itself being irrelevant, as long as it is taken 
to be finite. But this is merely a convention, which then 
justifies the folding transition in the model. 

It should also be noted that an energy gap is not the 
only mechanism by which a first-order transition and an 
ideal glass transition can occur. Both can occur without 
an energy gap as we will discuss below. Here it is suffi- 
cient to note that all one needs is a lack of concavity in 
the entropy for a folding transition. 

F. Small System Microcanonical and Canonical 
Entropies 

1. Microcanonical Entropy and Energy Landscape 

The microcanonical entropy is given by the Boltzmann 
relation ([5]), and has played a very important role in our 
attempts to understand the way folding occurs into com- 
pact native states along a very large number of micro- 
scopic pathways that connect a native state to myriad 
unfolded conformations. This entropy definition is useful 
when the system (such as a protein) is forms an isolated 
system so that its energy remains fixed, along with A/r, 
and V. The system occupies each of the various confor- 
mations r 6 T(E), all of energy E, with equal probability 

p(T) = 1/W(E). (12) 

Here, T(E) represents the set of conformations, each 
of energy E (for given jVr , and V, which we do not 
show below for simplicity), and contains W(E) distinct 
conformations. The corresponding ensemble containing 
these conformations is called the microcanonical ensem- 
ble (ME). 



Conjecture 3 The ME entropy via |5|) can most cer- 
tainly be defined even for a small system such as a pro- 
tein. 

This makes the Boltzmann entropy {5J a very useful 
quantity to study for proteins. There is an additional 
significance of this entropy or of the number W(E), as 
noted earlier. The number W(E) also characterizes the 
potential energy landscape for the protein (2f| [26|, [27| • 

It is a well-established tenant of macroscopic thermo- 
dynamics that in the physically relevant range of the en- 
ergy W(E) decreases with falling energy E so that 

dS/dE > 0; (13) 

consequently, the energy landscape for a macroscopic sys- 
tem in the physically relevant range of the energy is ex- 
pected to possess a structure that narrows down with 
falling energy. An example of such a landscape could be 
a funnel such as the surface of an inverted hyper-cone (a 
cone in a high-dimension space). The hypersurface area 
of such a cone at height E — Eq in a p dimensional space 
is proportional to [E — Eq) p ~ 2 , which satisfies the prop- 
erty (TT3"|) . Whether this property is also a characteristic 
of a landscape associated with a small system remains to 
be investigated. This is one of the aims of this work. It 
should be noted that the " energy landscape" for a lattice 
model will be discrete and not a continuous hypersurface 
0. 

Remark 4 Property \1S\) should be interpreted not as a 
differential property, but merely implying that S(E) de- 
creases with E for the discrete case. 

In the following, all differential relations will have such 
an interpretation for the discrete case, if applicable. 

It is known that the entire thermodynamics is con- 
tained in S(E), which is supposed to be concave (3(ij 
for a macroscopic system. Its violation is a signature of 
a phase transition in the model. Whether this concav- 
ity is also a characteristic of a small system ME entropy 
remains to be investigated. 

In view of the above discussion, it is important, there- 
fore, to investigate the form of S(E) and the effects of 
energetics on it for small proteins, which to the best of 
our knowledge has not been studied fully. 

2. Canonical Entropy 

The direct experimental approaches (primarily, crys- 
tallography) used to determine energetics in proteins at a 
given temperature T provide information about the con- 
formations associated with the average energy E at T. 
In this work, T is always going to represent the tempera- 
ture in the units of the Boltzmann constant. The protein 
is no longer isolated, but interacts with its environment 
at a given temperature T so that the energy can be ex- 
changed but Nr, and V still remain fixed. The system 
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now requires the canonical ensemble (CI 
modynamic description. Thus, one neec 
dependence of the canonical entropy S(l 
age energy E at a given temperature T . 
given by the Gibbsian relation 

S(T) = -Yj>(T)lnp(T), 

where p(T) is the probability to be in the < 
and is controlled by the energetics and the 
of the system; we have suppressed the lat 
in p(T) for notational simplicity. It is als 
the conventional entropy in the canonical 
by 

S(T) ee —dF(T)/dT, 

as we will show later; here F(T) is the 
energy, the thermodynamic potential in th 
semble. 

It is important to appreciate the sign 
form of the Gibbsian definition (TH| . It ( 
plied to the equilibrium microcanonical en 
case, p(T) is independent of T, and is gi 1 
is easily seen that the Gibbsian entropy, ; 
is exactly the same as the Boltzmann enl 
is true regardless of the size of the syst 
will take the Gibbsian definition (fT4|) to b 
systems of any size. 

For a macroscopic system, S(T) giver 
formulation is identical to the Boltzmam 
at the average or the most probable en 
temperature T; see in the random er 
an example. The general equality (TiTj) a 
late the energetics with configurational properties: me 
canonical entropy at T provides information about the 
conformations of average energy E. 

• Warning: There should be no confusion in distin- 
guishing S{T) and S(E), as their arguments will 
always be exhibited. This is important to note as 
we will show that the two quantities are very dif- 
ferent for small systems. 



III. MODEL 

A. Rooted or Anchored Protein 

A proper model for protein folding will require using 
scminexibility of the protein, for which we will use a re- 
cent model developed in our group [36| ■ It is the semiflex- 
ibility which gives rise to a crystalline phase; the latter 
represents the ordered native state of the protein at low 
temperatures. Therefore, we will treat a protein as a 
semiflexible self-avoiding copolymer chain on a lattice to 
study its folding by properly extending the above model 
[36| . The lattice is taken to be infinitely large (A L — > oo) 




Helical Turn 



FIG. 1: A 2-d model of a finite protein on a square lat- 
tice. The red spheres represent hydrophobic sites and the 
blue spheres represent hydrophilic sites. 



so that the protein will never feel the effects of its bound- 
ary. Each amino acid residue (including any side group) 
is represented by a tiny sphere, which must lie on a lat- 
tice site; see Fig. Q] Each solvent also occupies a lattice 
site. We will consider an incompressible model so that no 
voids are allowed. A site is either occupied by a residue or 
by a solvent. The self-avoidance condition means that a 
lattice site cannot be occupied by more than one residue 
or a solvent. We consider a two-state model [f| [37} in 
which each amino acid is classified either as a hydropho- 
bic site (red spheres in Fig. [T] and denoted by H) or a 
hydrophilic /polar site (blue spheres in Fig. [T] and de- 
noted by P). Due to the chemical structure of an amino 
acid, a protein is directional. One end of the protein has 
a free carboxyl group and is known as the C-terminus 
or carboxyl terminus. The other end of the protein has 
a free amino group and is known as the N-terminus or 
amino terminus. Proteins are always biosynthesized from 
the N-terminus to the C-terminus. On the other hand, 
most chemically synthesized proteins grow from the C- 
terminus to the N-terminus. Thus, a proper model should 
account for this directionality. Accordingly, in this work, 
we will incorporate the directionality of the protein, and 
treat both ends as dissimilar. This condition can always 
be relaxed without much complication. Treating both 
ends dissimilar basically doubles the number of distinct 
conformations of the protein, without any useful impli- 
cation for the way the entropy behaves. 
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1. Compact and Unrestricted Protein Conformations 

In our enumeration, we only consider a square lattic 
in this work. We will consider a protein to have eithe 
no restriction on its allowed conformations, or restrict i 
to only take a compact form, which we take to be rect 
angular. In the former case, the protein will be allowei 
to take all shapes including compact shapes by havini 
it probe all allowed sites on an infinite lattice. In th 
second case, the protein will be restricted to have onl; 
compact shapes so that there are no solvent molecules ii 
its interior; the surrounding of a compact region will b 
occupied by the solvent, i.e., water. The compact confor 
mations are also present in the former unrestricted case 
We will say that the conformations are unrestricted ii 
the former case and compact in the latter case. In botl 
cases, the end of the protein is rooted and is not allowei 
to move. There is a simple reason for rooting or an 
choring the protein. The process of folding in vivo oftei 
begins co-translationally, so that the N-terminus of th 
protein begins to fold while the C-terminal portion of th 
protein is still being synthesized by the ribosome. Thus 
it is the C-terminus that we root or anchor at the origin 
and allow the N-terminus to be free to begin folding. 

To generate compact rectangular shapes, we allow al 
possible rectangular shapes that could accommodate i 
given protein of size M. We give an example to clarify 
this point. Consider M = 24. For this case, we consider 
the following rectangular shapes in two dimensions: 1 x 
24, 2 x 12, 3 x 8, and 4x6. We do not need to separately 
consider 24 x 1, 12 x 2, 8 x 3, and 6x4 because of the 
rotational symmetry. 

The anchoring has three important consequences for 
our computation. In the first place, this reduces the 
number of conformations that need to be counted. On 
an infinite lattice, an unanchored protein can start from 
any of the infinite lattice sites, making W infinitely large. 
This trivial infinity due to nonanchoring has no bearing 
on thermodynamics. In the second place, anchoring al- 
lows us to uniquely define the distance between two con- 
formations as we will discuss below. From now on, we 
will always root our protein at one of its ends on the 
lattice. In addition, we will also restrict the protein con- 
formations so that its first bond from the root is along a 
fixed direction, which we take to be to the right, to limit 
the number of conformations. In order to further reduce 
the number of distinct conformations, we also restrict 
the first bend, as we start from the root, to be in the 
down direction of the square lattice. It is easily seen that 
any other conformation of the protein is related to one 
of the generated conformations by some trivial rotation. 
The last consequence of rooting is the following. There 
will be no doubling of conformations due to directionality 
that was discussed above. 

The number of conformations W for rooted proteins 
increases rapidly with the protein size, as is seen in Fig. 
[SJ The number of conformations W for rooted proteins 
increases rapidly with the protein size, as is seen in Fig[2] 
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FIG. 2: The rapid growth of W (shown in the common log 
scale) with M for an unrestricted protein on an infinite lattice. 
The allowed conformations are grown as described in the text. 



below. The growth of W for the rooted protein with its 
first bond in a specified direction on an infinite lattice 
can be fitted by 

W = 0.102272 exp(0.990933Af), 

with R 2 = 0.999876 (3^]. Correspondingly, the time re- 
quired to generate all the conformations W (but no other 
computation such as their energies, distances, etc.) also 
increases rapidly with the size M as the following Table 
IIII A II shows. The time reported here is on a PC. The 
time obviously increases if other computations are also 
carried out. 



Table Hn A II — Size and Computation Time on a PC 



M 


Finite 


Infinite 


16 


1 s 


10 s 


18 


1 s 


2 min 


20 


1 s 


1 hour 


24 


1 s 


3 days 


26 


1 s 


5 days 


36 


10 s 




49 


45 min 




64 


5 weeks 





B. Microscopic Interaction Energies 

To account for the presence of water surrounding the 
protein, water molecules (to be denoted by W) are also 
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allowed in the model. Each water molecule occupies a site 
of the lattice. To incorporate compressibility, voids can 
also be incorporated in the model. In that case, each void 
will be allowed to occupy a site of the lattice. We now 
turn to the complications induced by the compressibility. 



1. Simplification Resulting from Incompressibility 

Each conformation of the protein on the lattice results 
in certain sites of the lattice being occupied by the pro- 
tein. In the incompressible model, rest of the sites will be 
occupied by the solvent. Thus, each conformation of the 
protein is associated with only one possible distribution 
of the solvent molecules on the lattice. Accordingly, there 
exists one and only one microstate of the system (the lat- 
tice containing the rooted protein) for each conformation 
of the protein. In other words, the number of possible 
microstates of the entire system is the total number of 
conformations W of the rooted protein. It should be 
stressed that for sufficiently large volume V or com- 
pared to M, the number of conformations W will depend 
only on M but not on V or N-^. This is a major simpli- 
fication. The Gibbsian definition (TT?)) of the entropy of 
the system refers to the sum over the microstates of the 
system. This means that the sum in ([TJJ for the system 
is nothing but the sum over the conformations belonging 
to W. 

This simplification is lost if we consider a compressible 
model containing voids. Then, there will be many more 
possible distributions of the solvent for each conforma- 
tion of the protein. Let k denote one of the microstates 
of the system, and fc(T) the set of microstates that are 
associated with a conformation T of the protein. The set 
k(T) depends not only on M as above, but now it also 
depends on Al and iVo,the number of voids, even if 
is sufficiently large. This is very different from the situ- 
ation above for the incompressible limit. The entropy of 
the system is now given by the Gibbsian definition 

S(T) = -Y^Pk^Pk, (16) 

where pk is the probability of the fcth microstate. This 
entropy can be reexpressed as follows: 

r fcefe(r) 

The number of microstates of the system which deter- 
mine the sum in the Gibbsian definition p6[) will far ex- 
ceed the sum W of protein conformations. This will make 
the computation much more extensive, depending on the 
amount of free volume (i.e. of the voids): larger the free 
volume, more extensive the computation. Because of this 
complication, we only deal with the incompressible model 
in this work. 



2. Equal Size Approximation for Residues and Solvent 

We do not allow voids in the present work, and take 
the solvent (water) molecule and the residue each to oc- 
cupy a lattice site. This is an approximation as the water 
molecule and the residue do not have the same size. In 
a more realistic model, the water molecule and a residue 
may be allowed to occupy more than one lattice sites, de- 
pending on their relative size. While we can incorporate 
size difference in our lattice model, it makes the calcula- 
tion harder. To avoid this, we adopt the simplification of 
equal size in this work. 



3. Interaction Energies 

The excluded-volume effects are accounted by enforc- 
ing that a lattice site cannot be occupied by more than 
one residue or water molecule. The interaction ener- 
gies are restricted between chemically unbonded parti- 
cles (residues H and P, and water molecules W) that are 
nearest neighbors of each other. Long range interactions 
are neglected, but can be incorporated later if so desired. 
We will not do that here. There are three species of par- 
ticles (H,P, and W) in our model. As shown elsewhere 
[39l ] , we need to only consider three independent energies 
of interaction between three chemically unbonded pairs 
of species. We have decided to use the following three 
van der Waals energies enH, eHW, and epn between the 
three unbonded pairs HH, HW, and PH. In the stan- 
dard model due to Lau and Dill, only the first one in 
non-zero, as shown in Table IIIIB 31 To account for the 
scmiflexibility of the protein, we use the model recently 
developed by us to study crystallization and glass transi- 
tion in polymers [36| , but extend it to include preference 
of helical formation. The original model has a penalty 
eb > for making a bend, an attractive energy ep < 
between two parallel protein bonds, an attractive energy 
ehp < for a hairpin turn (on top of the penalty for two 
consecutive bends in the same circulation direction), and 
an attractive energy eu < for a helical turn (on top of 
the energy for four bends and two hairpin turns). 

We consider a protein with M residues in a given se- 
quence x of H and P associated with the residues on a 
square lattice, with one of its end fixed at the origin so 
that the total number of conformations W for a small 
protein remains finite even on an infinite lattice. We 
only consider the case in which the number of H and P 
are equal. This can be considered as the condition of 
charge neutrality. We generalize a recent model [36| . in 
which the number of bends Nb, pairs of parallel bonds 
A p , and hairpin turns Ah p characterize the semiflexibil- 
ity; see FigfTJ where we show a protein in its compact 
form so that all the solvent molecules (W) such as water 
are expelled from the inside and surround the protein. 
The dark spheres denote hydrophobic residues (H) and 
light spheres denote hydrophilic (i.e., polar) residues (P). 
The nearest-neighbor distinct pairs PP, HH, HP, PW and 
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HW between the residues and the water are also shown, 
but not the contact WW. Only three out of these six 
contacts are independent on the lattice [3^ |. which we 
take to be HH, HW, and HP pairs. A bend is where 
the protein deviates from its collinear path. Each hair- 
pin turn requires two consecutive bends in the same di- 
rection (clockwise or counterclockwise); see Fig. [U Two 
parallel bonds form a pair when they are one lattice spac- 
ing apart. We also use the number of helical turns Ay. 
On a square lattice, a "helical turn" is interpreted as 
two consecutive hairpin turns in opposite directions as 
shown in Fig. [T] The corresponding energies are eb, 
ep, Ch P , and en, respectively. The interaction energies 
are eHH = — 1, chw, and enp, corresponding to the HH, 
HW, and HP, respectively. The number of these pairs 
are Ahh; ftw, and Ahp, respectively. We let e'denote 
the set containing all {e^} , except enH = — lj and e the 
entire set {e^}, where i stands for b,p,hp,hl,HH,HW, and 
HP. Thus, e,e' represent the sets 

e = {eb, ep, eh P , ehi, enH, 6hw, epn} , 
e' = {eb, ep, eh P , em, eHW, epn} . 

Similarly, N = N(r) = {A,(r)} denotes the set 

N = {A b , A P , A hp , N hh A HH , A hw , A ph } , 

and N' denotes all {A^} , except Ahh : 

N' = {A b , A P , A hp , A M , A HW , A ph } . 

Let W(N) denote the number of protein configurations 
on a lattice of size Al > M. The energy of the configu- 
ration r corresponding to the set N is given by 

E(N) =e-N=^e l A i . (17) 

i 

The energy varies from configuration to configuration as 
it depends on N. But it does not depend on thermody- 
namic state parameters such as the temperature, pres- 
sure, etc. 

The dimensionless entropy function corresponding to 
configurations with a given N is defined as 

S(N) = lnW(N). (18) 

(This definition amounts to setting the Boltzmann con- 
stant equal to 1.) There will in general be many sets 
N that will result in the same energy E. We denote the 
collection of these sets by 'N(E). Thus, the number of 
configurations W(E) for a given E is obtained by sum- 
ming W(N) over this collection N(_E) : 

W(E) = ^( N )- (19) 

NeN(E) 

The corresponding entropy function for a given E is 
given, as usual, by ([5]). The total number of all pro- 
tein configurations, regardless of the energy E, is given 
by©. 



C. Various Model Energetics Choices 

The three choices we have most often made for energies 
are described below in the form of three different models, 
the parameters for which are shown in Table [Til B 31 



Tabic HH B 31 — Possible Models and their parameters 





Standard (A) 


Weakly (Bi) 


Strongly (Ci) 


Bend 





1/50 


1/3 


Parallel 





-1/50 


-1/3 


Hairpin 





-2/50 


-1/3 


Helix 





-2/50 


-1/3 


HH 


-1 


-50/50 


-3/3 


HW 





20/50 


2/3 


PH 





5/50 


1/3 



1. Model (A) 

In the standard model, the set N contains only one 
quantity, the HH contact number Ahh- Thus, e' = 0, 
and the adimensional energy in this model is simply given 
by E — Ahh- As Ahh is going to be an integer, the 
corresponding density 

n HH = Ahh/M 

is going to be a discrete quantity, so will be the adimen- 
sional energy density e = E/M = ?ihh- The number of 
conformations VF(Ahh) of a given Ahh is 

M/(A HH ) = EVF(A HH , N'). (20) 

In the standard model, E = Ahh- It is clear from ([20]) 
that the entropy 5(Ahh) = In W(Ahh) for a given Ahh, 
regardless of N', is maximum in the standard model 
[2lL l40l | . This feature of the standard model entropy is a 
possible justification of the observation made in [8[. As 
a consequence, the protein with a given Ahh will probe 
many more states in the standard model than in any 
other model, which then slows down its approach to the 
native state. Thus, it is important to have non-zero e' 
to step up the approach to the native state. (It is highly 
likely that the native states in different models are dif- 
ferent, but this does not affect the above conclusion, pro- 
vided the native states are unique.) There is another 
important consequences of having the remaining Si — 0. 
The fluctuations in the corresponding Ni are maximum 
as there is no penalty no matter what N' is. Hence, the 
protein will spend a lot of time probing a large number of 
conformations so as to maximize fluctuations in N'. This 
also suggests that we need to go beyond the standard 
model to describe proteins that fold fast. Correspond- 
ingly, the entropy per residue is also discrete, with two 
successive values differing in the argument by 1/M. In 
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other words, for small proteins, the entropy per residue 
s(e) is not a continuous function, but a set of discrete 
values, as shown in Figsf4] and [5l It is clear from the 
figure that one can easily draw a concave envelop for the 
discrete values of s(e). However, one can also draw a vari- 
ety of other envelop functions that would not necessarily 
be concave such as those shown by the lines joining these 
points in the figures. 

2. Weakly Perturbed Model (B U B 2 ) 

In this model, we allow for other energies to be non- 
zero, but still small compared in strength. The model 
with the parameters in the above table will be called Bi 
in the following. Another common choice we have made 
is e' = (3/56, -1/56, -3/56, -3/56, 21/56, 5/56), and the 
corresponding model will be called B2 in the following. 
The two models collectively will be simply denoted by 
B. The numerator of various energies are integers and 
are used to determine the energy E as an integer, which 
makes it easy to classify energy levels in groups of a given 
energy. The energy is divided by the denominator at the 
end to ensure that eHH = — 1. The energy corresponding 
to a HW-contact is the only energy close to |chh| ; this 
is to account for the strong repulsion between H and W. 
Otherwise, all other energies are extremely small com- 
pared to I eHH I ■ Consequently, this model will be identi- 
fied as a model with weak perturbation on the standard 
model. 

The model B2 can also be treated as a model with small 
perturbations on the model Bi (or vice versa) in which 
each residue is allowed to move about within the small 
cell surrounding the lattice site on which it is located. 
Such a disturbance will usually cause a small perturba- 
tion of Bi (or vice versa) and can be described by the 
model B. 



3. Strongly Perturbed Model (Ci,C 2 ) 

In this model, we allow for other energies to be not 
only non-zero, but also comparable in strength to e = 1. 
The most common choice we have made is the one shown 
in the Table |IIIB3| e' = ( b, -b, -b, -6, 2b, b), b = l/3(~ 
1). We will call this the model Ci. Again, the numerators 
for various energies are integers for the reason explained 
above. Another model called C2 has only one non-zero 
element = 1 in e'. Both models will be collectively 
denoted simply by C. 

The model A is the standard model. In the model B, 
we have most other interactions much weaker than |chh|, 
while they are comparable to |chh| in the model C. Thus, 
the model B is closer to the model A than to the model 
C is. Despite this, we will see that the models B and 
C behave very different from A. It should be noted that 
W does not depend on the model; it is its partition into 
W(E) that depends on the model. Thus, the shape of 



the energy landscape changes from model to model, but 
not its total " area" which is given by W [2l[ . 

D. Absence of Energy Gap 

1. Semiflexible Homopolymers and Absence of Energy 
Gap 

The semiflexibility of homopolymers has been ex- 
ploited by Flory to explain crystallinity by using a very 
simple model, which contained only the bending penalty 
(4l| . The energy was simply given by 

-Etlory = ebA^b- 

No other interaction such as with the solvent was consid- 
ered. Thus, the lowest energy is l?Fiory = 0. At absolute 
zero, the polymer chains are going to be all straight with 
no bends (provided the chains are finite in length). Thus, 
it is anticipated that they would give rise to an ordered 
structure. One possibility is that of an aligned configura- 
tion in which all chains are parallel to each other, though 
this is by no means the only configuration as one can 
envision many other configurations of the same energy 
-E-Fiory = 0. The aligned configuration was considered by 
Flory to represent the crystalline state formed by linear 
polymers. Thus, it is expected that the above simple 
model will give rise to a melting transition from a dis- 
ordered liquid state to a crystalline state at a melting 
temperature Tm- 

To make connection with our protein model, we will 
henceforth consider the limiting case of a single macro- 
scopically large semiflexible homopolymer chain. The 
original approximate solution due to Flory indeed shows 
such a melting transition at a non-zero melting temper- 
ature Tm- The approximation used by Flory gives rise 
to an energy gap, which is deduced by the observation 
that the resulting entropy based on the approximation 
becomes negative over the gap, similar to what happens 
in the random energy model discussed earlier in Sect. 
Ill El Over the gap, the entropy is replaced by S(E) = 0; 
we will use E instead of -Epiory in the following for con- 
venience. This gap then makes the entropy non-concave 
and results in a melting transition in the model. The 
transition turns out to be a freezing transition in that 
the entropy of the frozen state (the crystal) remains zero 
below the melting temperature, just as was the case for 
the random energy model. 

It was later shown by Gujrati and coworkers [52] that 
there was no energy gap in the Flory model of semiflexi- 
ble homopolymers. A macroscopic chain with no solvent 
was considered. For the infinitely long polymer chain in 
the absence of any solvent, the problem is also known 
as the Hamilton walk problem, the problem in which the 
walk visits all sites once and only once. The demonstra- 
tion of the absence of an energy gap was achieved by 
demonstrating that the entropy was never negative over 
the entire energy range in the model. The demonstration 
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itself was done by obtaining a rigorous lower bound to the 
entropy S(E). This required an explicit construction in 
which local excitations, the Gujrati- Goldstein excitations 
(GG excitations) which are pairs of oppositely oriented 
hairpin turns, populate the crystal. One such excitation 
is shown in Fig. [1] for the case of no solvent in the in- 
terior. It is the local excitation represented by the two 
hairpin turns where the parallel bond pair is shown in 
the figure: it is a "bound" pair of oppositely oriented 
hairpin turns and represents a GG excitation. These GG 
excitations should be distinguished from unpaired hair- 
pin turns. The unpaired hairpin turns either cannot be 
moved, or can be moved only by changing the number of 
bends or of parallel bonds or by introducing voids; see 
the hairpin turn in the second row (from the top) just 
above the shown HP pair in Fig. [TJ it cannot be moved 
up or down without increasing the number of bends or 
of parallel bonds or by introducing voids. In contrast 
to these, the bound GG excitations are highly "mobile" 
in that they can be moved about without changing the 
number of bends or of parallel bonds or by introducing 
voids until they hit another defect or the wall; see the 
excitation between the third and fourth row (from the 
top) in Fig. [T] which can be freely moved to the left. 
This " agility" of the excitation increases the entropy in 
the system without changing the energy in the model. 
It should be noted, see Fig. [TJ that an isolated hairpin 
turn can be turned into a GG excitation by increasing 
the number of bends by 4 and parallel bonds by 2, after 
which the excitation becomes " agile" to move. 

The distances over which the GG excitations can be 
moved can be easily estimated in a crude fashion by the 
defect density. This is similar to the interparticle distance 
between particles at a given concentration c, which is 
given by c™ 1 ^, where d is the dimension of the lattice. 
We can use for c the density Cd of the defects (the bends, 
hairpin turns or the GG excitations) in the crystal. Thus, 
the number of possible moves for a single GG excitation 
is this distance and is on an average 

W GG ^e- 1/d /a = c- 1/d , (21) 

as we have set a = 1. At T = 0, we surely have a = 0. 
The GG excitations along with other defects like the 
bends, the hairpin turns, etc. gradually populate and be- 
gin to destroy the perfect crystalline order by increasing 
the entropy as soon the temperature rises above T = 0, 
and the crystalline phase melts at the melting (or unfold- 
ing) temperature Tm into a disordered phase 36]. The 
crystalline state has been shown to occur via a sharp 
first-order transition if we have either an infinitely long 
macroscopic polymer [42j or a bulk system containing a 
macroscopic number of finite length polymers [36j pro- 
vided we allow other energies besides that for bending. 
As long as we have a single polymer, which is finite in 
length, the folding transition is not going to be sharp, 
but diffuse. 



2. Semiflexible Copolymer and Absence of Energy Gap 

The constructive proof of no energy gap also works for 
the current protein model, as we now discuss. The main 
difference is that while the calculation discussed above 
for the homopolymer is done rigorously, we do not have a 
rigorous calculation at present for the copolymer because 
of the complexity produced by the sequence structure. 
Our results are based on plausibility arguments, which 
we present below. As said earlier, the issue of an energy 
gap in proteins requires studying macroscopic proteins. 
We, therefore, consider a single macroscopic protein. We 
will also not consider any solvent, so that we are dealing 
with a Hamilton walk problem. Accordingly, M = Nz,, 
and c = 1/a = 1. As we have just seen, the presence of 
the Gujrati-Goldstein excitations in a homopolymer im- 
plies that there is no e nerg y gap in our model of melting 
for a homopolymer .'Ui. ,42j . We now extend the construc- 
tive proof to the copolymer case (or to the heteropoly- 
mer case). The complication arises from the presence of 
other interactions, such as the HH interaction. Let us 
for the moment only consider the bending penalty and 
the hairpin and parallel bond energies in addition to the 
contact interaction energy due to the HH pair contacts. 
Thus, we consider the variant models B and C and not 
the standard model in the following. We will return to 
the standard model later. 

Consider a macroscopically large copolymer of a given 
sequence x on a lattice. Let us consider the native state 
at T = 0. The attractive HH interaction and a favorable 
(negative) energy for a hairpin turn compete with the 
bending penalty in order to minimize the internal energy 
in the native state. In contrast, one only need to maxi- 
mize the HH contact number without any regard to the 
number of bends in the standard model, and to only min- 
imize the number of bends in the Flory model without 
any regards to the HH contacts. We will assume that 
there is only one unique native state (modulo any sym- 
metry operation). For example, for M = 24, we show the 
native state for the model Bi in (f3"2"|) , which is related to 
the native state in Q34p by a symmetry transformation 
(f30| as explained later. This does not prove but strongly 
suggests a unique native state even for larger M. 

Because of the favorable nature of hairpin turns, the 
native state must have a non-zero density of them. Thus, 
the defect density Cd would be non-zero at T = 0, which 
makes this problem inherently different from that of the 
semiflexible homopolymer. Some of the hairpin turns 
must be in the bound state in the form of the GG ex- 
citations. We assume that there is a non-zero density 
cqg of these excitations in the native state at T = 0. 
The native state will usually have the maximum number 
of the HH contacts for most of the sequences \ as e HH 
has the maximum strength. If we move a GG excita- 
tion, this will require a rearrangement on the lattice of 
that portion of the protein that is contained between the 
two hairpin turns of the excitation under investigation. 
We can crudely estimate the number of residues on this 
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portion of the protein as 



IV. SELF- AVERAGING AND SMALL PROTEINS 



n R ~ M/c d V = l/c d a d = l/c d . 

Half of this number is the average number of H residues 
in this portion. 

The positions on the lattice of the residues belonging 
to this portion of the protein will change with the move- 
ment of the GG excitation. Even though this movement 
does not change the number of bends and parallel bonds, 
it will invariably reduce the number of HH contacts com- 
pared to that in the native state. Thus, the energy of the 
deformed conformation due to the GG excitation move- 
ment will be higher than that of the native state. Indeed, 
this is true of any deformation of the native conforma- 
tion (including that generated by the movements of the 
GG excitations): Any deformation of the native state will 
always raise the energy since by definition, the unique 
native state has the lowest energy (at T = 0). For the 
deformation due to the GG excitation movement, this 
increase is due to breaking some of the HH contacts. 

Not much can be said about how much the increase in 
the energy will happen in displacing a GG excitation, as 
it depends strongly on the sequence \ an d on the topol- 
ogy of the native state. Furthermore, not all newly gen- 
erated conformations in Wqg will have the same excess 
energy. We now pick an extensively large number of GG 
excitations and move each of them, which results in Wqg 
new conformations. The new H^gg is the product of Wqg 
in [21] over the set of selected GG excitations in the con- 
struction. The resulting gain in the entropy density will 
be 

As ~ (n GG /d) lnc d , 

where uqg is the density of GG excitations used in the 
construction. We expect uqg to be proportional to the 
defect density Cd, at least for small Cd, so that the above 
entropy gain vanishes as Cd — > 0. 

Let Wgg(E) denote the number of conformations in 
the above construction to have the energy E, where E > 
Eq, Eq being the energy of the native state. Obviously, 

W GG = J2 W gg(E), 

E 

where the sum is over possible energies that appear in the 
construction due to the movement of the excitation. For 
a macroscopic system, the sum is going to be dominated 
by some energy E = E > Eq, so that 

W GG - W GG (E). 

But a little reflection will convince the reader that the 
excess energy density e — eo is also proportional to tigg- 
Thus, we will obtain a continuous energy density spec- 
trum in our construction. As the construction only gen- 
erates some of the conformations of energy E = E, the 
actual entropy gain is at least as much as As > given 
above. Consequently, it does not seem possible to have 
an energy gap for most of the sequences. 



For a system with quenched randomness, which in our 
case is created by the fixed sequence of amino acids, an 
important question about self-averaging has been probed. 
The idea is quite simple. Consider a protein with M 
amino acids in a given sequence X- The sequence for a 
given protein is fixed in Nature (or in the lab, where 
it is synthesized). However, there are several possible 
sequences. For example, consider all possible sequences 
for any given M in which there are exactly s H-typc 
residues and (AT — s) P-type residues. The number of 
possible distinct sequences is given by 

r - m 
Cm ' s = s\{M-s)V 

On the other hand, if we consider all possible sequences 
without any restrictions on the number of H-residues, 
then the number of possible sequences is 2 M correspond- 
ing to all possible values of s. The most probable value 
of s is s — [M/2] , where [x] is the integer part of x, 
since Cm,[m/2] is maximum. Let us denote the set of 
corresponding sequences by \. Loosely speaking, we will 
call these sequences the most probable sequences, know- 
ing well that it is the value of s or the corresponding set 
X that is most probable and not one of the sequences. 

Let Q denote a certain thermodynamic property like 
the energy of the native state, the free energy of the pro- 
tein, the number of helices in the native state, etc. This 
quantity will, in general, depend on the sequence x, and 
one can determine its quenched average 

<Q >scq =±-X2Q(x), (22) 

\x\ x 

where |x| is the number of possible sequences over which 
the averaging is done. The property Q is said to be self- 
averaging if 

km Q( X )= lim < Q > seq (23) 

M — *oo M— »oo 

for almost all x- As usually happens in the thermody- 
namic limit, x contains almost all the sequences. This is 
evident from the behavior of Cm,s for large M. The most 
probable sequence contains Cm,[m/2] — 2 M for M >> 1. 
This is also the number of all sequences. Then, the above 
condition of self averaging really refers to any sequence 
belonging to X- It is clear that the idea of self-averaging, 
which is not a trivial property, requires considering a 
macroscopic copolymer. If the property is self averaging, 
then the limit on the left in ([23|) is independent of the 
sequence x- This important property then gives rise to 
many simplifications. For example, it allows one to use 
the replica trick [43] to calculate the quenched averages of 
quantities such as the free energy. The trick represents a 
major technical advantage that has been extensively used 
quite successfully to study macroscopic random systems. 
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FIG. 3: The scaled distribution of W(eo) as a function of 
the native state energy eo for 10,000 different sequences for 
unrestricted coformations of M — 24. For the standard and 
the weakly perturbed models, we show the scaled distribution 
W(eo)/5 and 10VK(eo) so that the scaled distributions can be 
shown on the same scale. 



As shown in [13], there are strong indications that self 
averaging is valid for macroscopic proteins. 

It is instructive now to see how well the equality (12"3")) 
(without the limits on both sides) is obeyed for finite M. 
For this purpose, we consider the native state energy Eq 
to be the the thermodynamic property Q, and consider 
the quenched average of the native state energy density 
e = E /M: 

< e Q >so q = <E > scq = jjj^ 

over all sequences that belong to Xi so that the average 
is taken over all sequences with the restriction of equal 
H and P (even M). Thus, not all sequences are allowed. 
This is done because of the importance of the most prob- 
able sequence noted above and requires evaluating E (x) 
for each sequence in X- 

Let Wn(co) denote the number of times a given native 
energy eo = Eq/M appears among all sequences in X- 
We then calculate the relative root mean square (rms) 
fluctuation 

lfl i V< e Q >scq -(< 6p > SGq ) 2 . . 

( de 0)soq = 1 i , (24) 

|< eo > SC q| 

where 

<el > 8eq -^E^o(x). 

Standard arguments [43| show that the relative fluctua- 
tion (£eo) scq should decrease as l/vM for large M: 

(5e Q ) scq ex 1/VM. (25) 



We have done the calculations for the three models for 
M = 16, and M = 24 on an infinite lattice. For M = 16, 
we have considered all the sequences in x, each with 
equal number of H and P residues. The total number of 
these restricted sequences is 

016,8 = 12,870. 

For M = 24, we have only considered 10, 000 differ- 
ent sequences for the three different classes of energet- 
ics, which is a small fraction of all allowed sequences 
£24,12 = 2,704,156. We only show the distribution for 
M = 24 in Fig.©. The results for various quenched av- 
erages and the relative fluctuations are summarized in 
Table HVl 



Table ITVl — Quenched averages and relative fluctuation 





Model 


< eo >scq 




< Se > seq 


M = 16 


Model A 
Model B 2 
Model Ci 


-0.3208 
-0.1959 
-0.1107 


0.1058 
0.0427 
0.0174 


0.1674 
0.3344 
0.6448 


M = 24 


Model A 
Model Bi 
Model Ci 


-0.2927 
-0.1720 
-0.1336 


0.0867 
0.0314 
0.0212 


0.1079 
0.2440 
0.4320 



We see that the relative fluctuation increases as 
the strength of the perturbation increases for both 
sizes. In addition, it appears that the relative in- 
crease (0.4320/0.1079 = 3.8519 for M = 16) or 
(0.6448/0.1674= 4.0037 for M = 24) does not apprecia- 
bly change with the size. This needs to be investigated 
further for other sizes. Moreover, the relative fluctuation 
is not small, implying that the spread of the distribution 
IFN(eo) is not insignificant. If we calculate \f~M (5eo) seq 
from Table IIVI we observe that this product is much 
smaller for M = 24 than for M — 16, while accord- 
ing to |25|) . this product should not change. There are 
two possibilities for this behavior. It is quite conceiv- 
able that either M = 24 is not large enough for (f2"5"]) to 
be observed or that the choice of only 10, 000 sequences 
for M = 24 does not give a good estimate of the relative 
fluctuation (Seo) scq . Thus, our results may not be reliable 
enough to prove or disprove self-averaging for a macro- 
scopic protein. Nevertheless, the results in Table HVl for 
the small proteins that we have considered in the present 
work clearly show that the average native state energy 
< eo > S cq, though highly probable, does not represent 
the native state energy of almost most of the random se- 
quences in X- There is no reason to believe that other 
thermodynamic quantities will have their sequence aver- 
age equal the average of any randomly selected sequence. 
Thus, small proteins are not self-averaging. This is con- 
sistent with the accepted result in the literature, see for 
example, [l6l |. that sequences play an important role in 
small proteins. 
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The situation in Fig. ([3]) does raise an interesting ques- 
tion. We see the most probable native energy is far 
from the lowest native energy for each of the three mod- 
els. Does Nature prefer to design proteins whose native 
energies are close to the most probable native energies 
or to the lowest native energy? It should be remarked 
that all those sequences that have their native energies 
close to the most probable native energy do not fold into 
one unique native structure, though many sequences are 
found to have the same structure (conformations without 
any regard to the sequence). The native conformations, 
though compact, have varied structures. 

For the standard model in Fig. ([3]), we observe that 
there are eight different native energies for the 10 4 ran- 
dom sequences for the standard model. We find that 
cq ~ —0.3 is the most common native state energy; all 
these native states differ only in their sequences. None of 
the models ascribes a unique structure of the native con- 
formation to a particular sequence. However, the stan- 
dard model does point to an interesting fact. The number 
of sequences with the lowest native energy e — —0.38 is 
an extremely small fraction of the 10 4 sequences consid- 
ered here. It should be remarked that the sequence xo 
described below in [26] gives a much lower native state en- 
ergy eo = —0.4167, and is not part of the 10 4 sequences 
whose results are shown in Fig.Q. For M — 16, there 
are seven different native energies between eo — —0.44 
and eo = for the standard model. The most dominant 
native energy is eo — —0.31 given by 5664 sequences, but 
the number of sequences with the lowest energy (430) is 
not as small a fraction as for M = 24. Thus, it appears 
that the fraction of sequences among all sequences that 
gives the lowest possible native energy is small, this frac- 
tion becoming smaller as the protein size increases. This 
suggests that the most probable native energy distribu- 
tion becomes narrower with the size M. This observation, 
which seems to support the emergence of self-averaging 
for M — > oo, needs to be checked further. 

For the weakly perturbed model, the same distribution 
Wk(eo), see Fig. (j3)), exhibits a clear band structure; the 
number of bands seems to be clearly controlled by the 
number of possible energies in the corresponding stan- 
dard model. However, the band structure is "smoothed 
out" for the strongly perturbed model because the lat- 
ter does not allow as many native state energies as the 
weakly perturbed model. Because of this difference in 
the allowed native state energies, the maximum Wn(co) 
for the weakly perturbed model is much smaller than the 
maximum Wn(co) for the strongly perturbed model. 

We have found that in the majority of cases that we 
have investigated, the following sequence containing a 
repetition of PHHP and which we denote by xo 

Xo : (PHHP)„ (26) 

gives rise to the lowest energy or very close to it. Because 
of this, we mostly present results based on this particu- 
lar sequence xo m this work, though we have considered 
other sequences also. 



V. ENERGETICS AND NATIVE CONFORMA- 
TIONS 

Let us fix M — 24 and consider unrestricted conforma- 
tions. The sequence is fixed to xoj i-e. to 

PHHPPHHPPHHPPHHPPHHPPHHP 

for the reason explained in the preceding section. For 
the standard model, there are 30 native states, all of 
the same energy density eo = —0.4167, as discussed in 
the following. One of the native states is the following 
conformation: 



IP 


2H 


3H 


4P 


8P 


7H 


6H 


5P 


9P 


10H 


11H 


12P 


16P 


15H 


14H 


13P ' 


17P 


18H 


19H 


20P 


24P 


23H 


22H 


21P 



(27) 



and can be represented by the string 

RRRDLLLDRRRDLLLDRRRDLLL, (28) 

which is read from the left and refers to the sequential 
steps from the first residue along the right (R), lcft(L), 
up (U), and down (D) directions. The first step is always 
to the right direction, and the first bend is always in the 
D direction. This is done to cut down the number of con- 
formations to be counted. All conformations in which the 
first bend is in the U direction is topologically identical 
to one of the conformations that we generate. Similarly, 
conformations that start not in the R direction are also 
topologically not distinct. Despite these restrictions, we 
still duplicate some conformations if the two ends of the 
protein are treated identically. This happens when the 
last step of the protein is in the L direction and the bend 
before the last step is in the U direction. We will ex- 
plicitly demonstrate this below. However, this does not 
affect us as we deal the two ends as different. 

We also report the nine other native states that are 
given by the strings 

RRRDLLLDRRRDLLLDRRRDLLD, 
RRRDLLLDRRRDLLLDRDDRUUR, 

RRRDLLLDRRRDLDRDLLLURUL , 
RRRDLLLDRDLDRDDRUURULUR, 
RRRDLDRDLDRDLLLURULURUL, (29) 
RRDLURDRURRULLULDLULDLL, 
RRDLURDRURRULLULDLULDLU, 
RDLDRRRULURURRRULLLURRR, 
RDLDRRRULURURRRULLLURRU. 

We notice that the third and the eighth strings above are 
related by 

L R,U D, and the reversal of the strings; (30) 
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an example is given below for clarity. Thus, there are 
only 9 distinct native states if the two ends are treated 
identically. Of course, the above symmetry transforma- 
tion does not affect our calculation since we make a dis- 
tinction between the N-terminus and the C-terminus. 

For the weakly perturbed model Bi, there are two na- 
tive states of energy density eo = —0.3717. The native 
state string 

RRRDLDRDLDRDLLLURULURUL (31) 
represents the following native state 



IP 


2H 


3H 


4P 


24P 


23H 


6H 


5P 


21P 


22H 


7H 


8P 


20P 


19H 


10H 


9P 


17P 


18H 


11H 


12P 


16P 


15H 


14H 


13P 



(32) 



The other native state string 

RDLDRDLDRRRULURULURULLL (33) 
represents the native state 



24P 


23H 


22H 


21P 


IP 


2H 


19H 


20P 


4P 


3H 


18H 


17P 


5P 


6H 


15H 


16P 


8P 


7H 


14H 


13P 


9P 


10H 


11H 


12P 



(34) 



This native state is topologically identical to the previ- 
ous native state, and is described by the string obtained 
by the symmetry transformation (|30[) . as noted above if 
the two ends are identical. It is clear that the weak per- 
turbation alone has drastically reduced the native state 
multiplicity from 30 to 2. This shows the importance of 
even the weak perturbation. 

The strongly perturbed model, surprisingly, has three 
native states given in (|2T|) . (f3"2"|) . and (f3~4")k the last two 
are related to each other by the above transforma- 
tion. This suggests that the relationship between a 
given native state and the energetics is quite complex. 
The set N for the first two native conformations are 
(10,15,5,0,10,4,0), and (18,12,9,7,10,4,0); the third 
native conformation has the same N as the second one 
above, which should not come as a surprise. The en- 
ergy density of each of the three native conformations 
is (—32/72). If we use eh P = —2/3 = eh, then only the 
last two conformations survive as the native conforma- 
tions; the first one is no longer a native conformation. 
Now, the native energy density is (—48/72), and N is 
(18,12,9,7,10,4,0), the same as for the previous set of 
energetics. This is a clear demonstration of the fact that 



the same native state can occur in various different mod- 
els. Therefore, one cannot determine effectively the en- 
ergetics of a protein by only studying the native states. 
For this, one must also investigate many of the non-native 
conformations . 



VI. SMALL SYSTEM THERMODYNAMICS 
A. Microcanonical Entropy 

1. Equilibrium 

The dimensionless ME entropy corresponding to con- 
figurations with a given energy E is given by the Boltz- 
mann relation (O; as above, we have set the Boltzmann 
constant equal to 1. This entropy is relevant if the en- 
ergy of the protein is held fixed. Keeping E constant is 
not the same as keeping each term ejiVj in the sum in 
(|17p constant; the latter can change as long as the sum 
in (I17p remains constant. We define the equilibrium to 
mean that the protein explores all possible conformations 
included in W(E) with equal probability given in (| 12[) . 

Let us recall the arbitrary positive energy e (we can 
take this to be the magnitude |chh| for concreteness) 
that we have used to introduce the adimensional en- 
ergy E, which is really E/e For a small protein 
(M < oo), each element in the set N is finite. Thus, the 
adimensional energy is also finite, with the closest spac- 
ing A m [ n E between two successive values of E at least 
|e m in| (which is really |e m i n | /e), where e m i n is the element 
with the smallest magnitude in the set e. Therefore, for 
small proteins, E is a discrete variable. The correspond- 
ing energy density per residue 

e ee E/M 

is also discrete and becomes continuous only when M — > 
oo. Thus, as long as M is finite, the energy and the en- 
tropy density per residue 

s(e) ee S(E)/M 

remain discrete. In addition, they also depend on M for 
small proteins 13]. To show this most clearly, we repro- 
duce s(e) for the strongly perturbed model Ci in Fig. 0] 
for M = 16,24,32,40, and 48; we restrict the conforma- 
tions of the protein to be compact. There continues to 
be a dependence on M, even though the largest value of 
M is 48. We also note that the discrete nature of the 
energy and entropy persists. There is a clear evidence of 
many local maxima in the entropy, each maximum sur- 
rounded by many energies of lower entropy forming an 
energy band. These bands are well separated by gaps in 
the energy, at least near the low end of the energy even 
for M = 48. It is surprising to observe the erratic form 
of the entropy in that the bands are highly irregular in 
shape, at least near the low energy end. The entropy 
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-0.2 0.0 0.2 0.4 



FIG. 5: The behavior of s(e) for the weakly perturbed model 
Bi on an infinite lattice as a function of the protein size M. 
We observe that s(e) for the larger size contains that for the 
smaller size inside it. 



function is becoming somewhat smoother (but still dis- 
crete) near it global maximum because the energy levels 
are becoming denser in this range. 

In Fig. [5l we show the ME entropy s(e) when the pro- 
tein conformations are unrestricted. We are considering a 
weakly perturbed model Bi. We again see a dependence 
of s(e) on M, as before. Similarly, the allowed energy 
densities continue to depend on M. This dependence is 
not so weak to be negligible, especially near the low en- 
ergy range, the range more appropriate and influential 
in studying protein folding. This is a clear indication 



that one cannot treat the densities such as s, and e to 
be independent of M. This point does not seem to be 
appreciated in the literature; see for example, [34|]. We 
notice that s(e) remains discrete even for M = 24, close 
to the largest protein we have investigated in the case 
when the conformations are unrestricted. 

There are some common features in both figures [4] and 
[5] The first feature is the presence of gaps in bands of 
s(e) at lower energies: there is a clear energy gap between 
the two lowest bands for M ranging from 24 to 48 for 
compact conformations and for M ranging from 20 to 24 
for unrestricted conformations. The gap decreases with 
M in both cases. This is consistent with the claim in Sect. 
IIIIDI of no energy gap in the model. Another feature we 
notice is that s(e) is usually higher for larger M over 
a wide range of energies. There is a certain pattern in 
the undulations present in s(e) : they seem to form a 
band structure with several peaks within each band; the 
number of peaks in a band keeps increasing with M. The 
presence of these bands will be explained below. 

The native state of the protein is, by definition, the 
lowest energy state at absolute zero. Depending on the 
interactions in the protein and the sequence \ of H and P 
residues in it, the native state may or may not be unique. 
In the latter case, the multiplicity of the lowest energy 
state will indicate that the protein functionality is not 
simply determined by the native state. (We will call this 
multiplicity the degeneracy of the native state.) The way 
out of this dilemma is to have the energetics tuned in such 
a way that the native state becomes unique. At present, 
our understanding of protein functionality is not so com- 
plete to answer this question unambiguously. Therefore, 
we will allow the occurrence of degenerate native states 
and study the effect of energetics on this degeneracy to 
learn how the energetics should be tuned to give a unique 
native state. It may be that there exist high energy bar- 
riers between these native states so that it is impossible 
for the protein to jump from one native state to another 
in a finite amount of time. However, it should be rec- 
ognized that for a small protein (M < oo), no energy 
barrier of any kind except due to excluded volume in- 
teractions (which occur when a site is occupied twice, 
but do not exist in our lattice model as only configura- 
tion satisfying excluded volume constraints arc allowed) 
can be infinitely large; hence, the time required to trans- 
form from one native state to another will remain finite, 
though it may be large in some cases. Thus, this idea of 
a large barrier to explain the robustness of a protein may 
not be so reliable or relevant. 



2. N on- equilibrium 

Away from equilibrium, the protein will not explore 
all the conformations in W(E) with equal probability. 
In this case, the entropy of the non-equilibrium state is 
given by the Gibbsian relation (|14p in which p(T), where 
F is one of the conformations in W(E), is independent of 



20 



the temperature. This non-equilibrium microcanonical 
Gibbsian entropy will eventually achieve its maximum 
under the constraint 



Ep(r)^i, 



(35) 



as the protein equilibrates. This is easily seen by the using 
the Lagrange multiplier trick to maximize the combina- 
tion 



£>(r)(-inp(r) + A), 



where A is the Lagrange multiplier. The resulting distri- 
bution is given by 

p(T) = exp(A - 1). 



The use of (|35|) determines the Lagrange multiplier 

exp(A- 1) = l/W{E). (36) 

Thus, the resulting equilibrium distribution is given by 
the Boltzmann relation ([5|) . This is the conventional law 
of increase of entropy in thermodynamics as the system 
moves towards equilibrium. 

This formulation of the second law is obviously applica- 
ble to small systems such as our protein in our approach 
based on the Conjecture [31 and also justifies our Conjec- 
ture [1] 



energy level of the standard model turns into a band; see 
the bands of the perturbed models in Fig. [51 We see 
that there are exactly 11 bands, equal in number to the 
11 energy levels in the standard model A. The energy gap 
between the bands at the low end of the energy spectrum 
in the weakly perturbed model is also a manifestation of 
the energy gap in the standard model. This gap is easy 
to notice in Fig. [5] where we have also shown the entropy 
at low energies for the model Bi. This gap seems to be 
almost filled up in the strongly perturbed model C; see 
Fig. M 

As M increases, the energy spectrum in e becomes 
dense so that e and, therefore, s(e), become continuous. 



C. Canonical Partition Function 

A protein in Nature is not a closed system as discussed 
above. Therefore, the ME is not the most suitable en- 
semble to investigate. As the protein interacts with its 
surrounding at a given temperature T, we need to con- 
sider the CE in which the temperature of the system and 
its surrounding is held fixed. This description is more 
realistic and can be characterized by the canonical par- 
tition function given by 



Z(T) = Y,W(E)eM~f3E), 



(37) 



B. Behavior of the Compact and Unrestricted 
Conformations 

The behavior of S(E) is different compact and un- 
restricted conformations. We first consider the stan- 
dard model. For unrestricted conformations, the max- 
imum energy corresponds to non-compact conformations 
of which there are many; the actual value depends on 
the value of M. Thus, S(E) does not vanish at the up- 
per end of the energy. Here, the entropy continues to 
increase as the energy increases. This can be easily seen 
in Fig. [5] On the other hand, the situation is drastically 
different for compact conformations. Here, there are not 
that many configurations of the highest energy. Thus, 
the entropy first rises and then drops as the energy in- 
creases. This remains true for any of the three models, 
and we refer the reader to Fig. [4] where we have shown 
the results for compact conformations in the model Ci. 

Let us consider unrestricted conformations of the pro- 
tein. The standard model entropy will be perturbed dras- 
tically even with weak perturbation of energies. This is 
because the number of conformations that contribute to 
W{N- SH _^ aax ) at the highest energy E x = -A H H,max in 
the standard model will redistribute themselves in a band 
due to weak energy perturbation. The spread of the band 
will now give zero or very small entropy at the highest 
energy in the two perturbed models. This causes a dras- 
tic change in the form of the entropy distribution: each 



where (3=1 /T is the inverse temperature in the units of 
the Boltzmann constant. The reader should be warned 
that we are using the partition function formalism, which 
is believed to give the correct thermodynamics of large 
systems, for the current case of a small protein. The 
thermodynamics of a small system is far from a complete 
understanding in that it is not known if the small system 
thermodynamics is the same as that predicted by the use 
of the above partition function ((37)) . We will not be con- 
cerned with this issue here and adopt the most prevalent 
view in the field and use the above small-system partition 
function formalism to study the thermodynamics of the 
small system. A credible justification of this adoption 
will be provided at the end of the next section. 

It is convenient to rewrite the partition function as a 
sum over N as follows: 

Z(T) = ^2w(N)exp[-pE(N)}. 

N 

^From this, we can calculate the thermodynamic averages 
Ni as follows: 



Ni 



E N AW(N)exp[-/?ff(N)] 
Z{T) 



^ lnZ(T) )' 

where the derivative is taken at fixed /3e£, where e[ rep- 
resents the set of all the remaining energies in the set e 
except ej, and may be a null set. If we introduce the 
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lnZ(T) 



fluctuation AN = Ni - N, then 
It follows, therefore, that 

'dWi 



and 



fdNi 



> 0. (39) 



< 0. 



(40) 



As said above, the derivative is taken at fixed S3e[. 



D. Canonical Averages, Fluctuations, and En- 
tropy 

1. Equilibrium 

We define the system to be in equilibrium, when the 
canonical probability distribution for T is given by 



p(T) = e-P E W/z(T) 



(41) 



where the partition function is given in (|37|) . which can 
also be written as a sum over T : 



Z(T) = J2e 
r 



-/3E(T) 



(42) 



One can also introduce the probability for the system to 
have a given energy E : 



p{E) = W(E)e- f)E(T) /Z(T) 



It is clear that 



EPF) EE EE 1. 

r e 



(43) 



(44) 



The canonical probability distribution p(T) can be 
used to directly evaluate the thermodynamic average (to 
be denoted by an overbar in the following) of any ther- 
modynamically extensive quantity [24| 0(T) using 



O ee J20(T)p(T). 
r 



(45) 



Similarly, we can use p(E) to directly evaluate the ther- 
modynamic average (again to be denoted by an overbar in 
the following) of any thermodynamically extensive quan- 
tity O(E) using 



O = EO(E)p(E). 

E 



(46) 



Both averages are functions of the temperature T. Two 
of the examples of such averages are N(T), and E 
v ■ N(T); see (l38l) . It is easy to see that 







E = -[^nZ(T)), 



(AEY 



d_ 

"dp 



\nZ(T) 



BE 
~df3 



> 0, 



(47) 



where AE = E — E is the energy fluctuation. Thus, E 
is a monotonic increasing function of T. 

Let Eq and E\ denote the minimum and maximum 
allowed energies in the model, and E the energy at which 
S(E) has its maximum. At absolute zero (T = 0), it is 
easy to see that E(0) = Eq. At infinite temperatures, 



E(oo) 



w 



and can be very different from E due to the finite size. 
(Their equality occurs only for a macroscopic system.) 
Consider M — 48, Model Ci, and all its conformations 
in the compact form. There are 1, 194, 244 distinct con- 
formations, and the exact calculation provides 

e(oo) = 0.0521, and e = 0.0625, 

where the energy density per residue e(oo) = E(po)/M 
and e = E /M. The energy density per residue eo = 
E /M = -0.5764, and e 1 = E l /M = 0.3750. The num- 
ber of conformations of energy E is 38, 707, so that the 
entropy density per residue is s(e) = 0.2201. The two 
energies e(oo) and e are very different. One can also ob- 
tain e(oo) > e. Nevertheless, E monotonically increases 
with T from E(0) to E(oo). This does not guarantee that 
each Ni also increases monotonically with T (except in 
the trivial case of the when the set N has a single mem- 
ber such as the standard model). Indeed, some of them 
may actually decrease with T . 

It is convenient to introduce various densities associ- 
ated with average extensive quantities of interest by div- 
ing by M : 

e ee E/M, m ee Nl/M. 

It is these densities that will approach a limit as M be- 
comes larger and larger [3]; see Figs. [4] and 03 For finite 
M, they remain functions of M. 



2. N on- equilibrium 

If the system is not in equilibrium, then the canonical 
probability distribution is not given by (|41|). However, 
the entropy of the non-equilibrium state is still given 
by (|14p . where p(T) is the non-equilibrium probability 
distribution; it will also depend on T. This distribution 
should be used to calculate configuration averages by us- 
ing (|45[) . As the system approaches towards equilibrium, 
p(F) changes so as to maximize the entropy under two 
constraints, one of which is the above constraint (|35[) . 
The other one is the constraint on the constancy of the 
average energy 



5>(r)J5(r) = E = constant, 
r 



(48) 
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Again, using two Lagrange multipliers A and 7, and max- 
imizing the combination 

5>(r)hinp(r) + A + 7 £(r)], 

r 

we find that the resulting probability distribution is given 
by 

p(T) =exp[A- l+7^(r)]. 

This distribution can be used in (fT4|) to find the cor- 
responding entropy. Comparing this entropy with the 
relation ([51]) below, we conclude that the two Lagrange 
multipliers are 

7 = -P, 

and 

exp(A - 1) = 1/Z(T); (49) 

consequently, the equilibrium probability distribution is 
given by given by (|4ip . as expected. 

^From now on, we only carry out equilibrium calcula- 
tions. 

E. Justification of Using (|37|) for Smali Systems 

The free energy in the canonical ensemble is the 
Hclmholtz free energy 

F(T) = -T\nZ{T), (50) 

from which we can also obtain the canonical entropy S(T) 
by using (|15[) . This entropy satisfies the conventional 
thermodynamic relation 

S(T) = (3 \E(T) - F(T)] (51) 

as can be easily verified by using ([50)1 in ([T5|) . ^From 
this, we find that (at constant extensive quantities such 
as the "lattice volume", numbers of residues, etc.) 

d~E = TdS + SdT + dF = TdS, (52) 

which is the first law of thermodynamics now valid for a 
small system. 

Let us compare the canonical entropy in (j!5[) with the 
S(T) given by the Gibbsian relation (Q3|. We find that 

S(T) = £ \J3E(T) + In Z(T)]p(T) = [3 \E(T) - F(T)] , 
r 

and is identical with the canonical entropy above in (j!5|) . 
The two ways of calculating the canonical entropy give 
the same result even for a small system. In other words, 
the Gibbsian relation (jT4j) is also valid for a small system. 
This is a justification of adopting the partition function 
formalism for small systems, as discussed in the previous 
section. 



VII. SMALL SYSTEM MICROCANONICAL 
AND CANONICAL ENTROPIES 

A. S(E) > S(E) 

It should be stressed that one must always use the 
probability of a conformation (usually called a microstate 
in statistical mechanics) p(T) in the Gibbsian relation 
(fT4")l . In other words, one cannot group these microstates 
and use the probabilities of the groups. We will demon- 
strate this by an example, let us group the microstates 
of a given energy together and use the probability p(E) 
to construct the combination 

E = -J2p(E)hip(E), (53) 

E 

which looks similar to the combination in the Gibbsian 
relation (fl"4"|) . It is easily seen that 

£ = S(T)-S(T), (54) 

where 

S(T) = ES(E)p(E) (55) 

E 

is the thermodynamic average entropy, so that £ does 
not give S(T). Moreover, since £ is, in general, not zero, 
S(T) in (fT5")) or (TT4"]) is not the same as the thermody- 
namic average entropy S(T) in (|5"B"|) . Thus, the concept 
of microstates (or conformations in the context of pro- 
teins) is crucial in using the Gibbsian relation ([TJJ to 
obtain the canonical entropy. 

An important consequence of (|53[) is the following. 
Since < p(E) < 1, it is evident that E > 0. Hence, 

S(T)>S(T). (56) 

^From (|55j) . we conclude that S(T) > 0, since it is an 
average of a non- negative quantity S(E). Thus, 

S(T) > 0. 

This then proves that the free energy F(T) is a monoton- 
ically decreasing function of T even for a small system. 

In the thermodynamic limit [M — > 00), E will ap- 
proach zero from above, as the sum in (|53p is replaced 
by a single term corresponding to E = E(T), for which 
p( E) = l. Thus, S(T) approaches S(T) from above. 

Both S and E are continuous function (except possi- 
bly at a phase transition, which is not relevant here as we 
are dealing with a finite protein) of the continuous vari- 
able T. We now wish to express the canonical entropy 
S{T) as a function of the average energy E. To do so, we 
recognize that the derivative dE/dT is non- negative; see 
P?)) . Thus, it can be inverted to express T as a function 
T(e), where e = E/M. This allows us to express S(T) as 
an explicit function Sj~E) = S [T(e)] of E. (S(E) should 
not be confused with S(T) in (f5"5"]) . as the two have differ- 
ent arguments.) The entropy S(E) can be thought of as 
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the canonical equivalence of the microcanonical entropy 
S(E). However, they are two different quantities for small 
proteins. In the first place, S(E) is a discrete function 
since E is discrete, while S(E) is a continuous function 
of the continuous variable E. In the second place, 

S(E) > S(E), (57) 

the equality holding as M — > oo [2l|. This inequality 
should not be confused with the above inequality (|56|) . 
To demonstrate (f57| , let us assume that E = E is one of 
the energies in the sum in the PF (|57|) . We then rewrite 

S(E) ee S(T) = lnZ(T) + E/T, 

and evaluate exp[S(E)]: 

exp[S(E)} = W(E) + £ W{E)e~^ E ~ E) . (58) 



The above proof does not depend on the discrete na- 
ture of the energies in ME; thus, it is also valid for con- 
tinuum models though more care is needed. We show in 
Fig. [S]the entropies per residue 

s(e) ee (l/M)S{E) 

by symbols, and 

s(e) ee (l/M)S(E) 

by curves, for the three models for the case M — 24 as 
a function of the discrete variable e ee E/M or e from 
our exact enumeration. The energy densities have been 
shifted by the lowest energy density eo = Eq/M for each 
model separately so that all the the curves have the same 
origin. 



The sum above is non- negative; hence, exp[S(E)] > 
W(E), which proves (l57|) above. The difference between 
S(E) = S(T) and S(E) is due to the last term in ([58]) . 
which is expected to vanish as M — > oo. 

In case, E is not one of the energies in the sum, we 
can use a suitable interpolation to define W(E), without 
affecting the conclusion. We give a simple interpolation 
scheme to show this. Let E lie between two allowed en- 
ergies Ei (should not be confused with E\ introduced 
earlier as the highest allowed energy in the model) and 
E 2 > Ei in the microcanonical energy spectrum, and 
introduce 5E = E 2 - E x > 0. Let ~E = E x + xSE, 
E 2 _=E+(1 - x)SE,S(E 1 ) = S(E) - xS'SE,S(E 2 ) = 
S(E) + (1 - x)S'SE, where S' = [S{E 2 ) - S(Ei)] /SE. 
The two terms in exp[S l (_B)] in (|58[) containing E\ and 
E 2 are 

W(E 1 )e-^ E ^ + W{E 2 )e-^ E ^ 



W(E) 



+ e -(l-«)a 



where a = j3 (1 — TS') 8E, as can be easily seen. Assum- 
ing a > 0, we can write 

6"° = 1 + 7, 7 >0. (59) 

It is then obvious that we can express exp[5(£')] as 



B. Concavity of S(E) and Its Absence in S(E) 

1. Concavity of 8(E) and Thermodynamic Stability 

We also see a distinct band structure in s(e) for the 
two perturbed models (Bi and Ci) in Fig. [S] The band 
structure is related to the nature of the perturbative in- 
teractions and has no implication for any phase transition 
as we now discuss. From (1471). we see that 



(60) 



dE\ 



which states that the canonical heat capacity is non- 
negative, and is one of the requirements of stability of 
the system regardless of the size. ^From the relation 
(152p , it is easily seen that the canonical entropy function 
satisfies the conventional thermodynamic relation [2ll | 



8S(E)/dE = l/T. 



(61) 



^From (|6"0|) and above, we conclude that S(E) is, there- 
fore, concave 



d 2 S(E)/dE <0 



exp[S(£)] = W(E) + W(E) 



7 + e 



— (1 — x)a 



+ £ W(E)e-^ E - E ~> 

> w(e), 

which proves ([BT|) for this case also. For a < 0, we use 
e -( 1 - a; ) Q on the left side of (|59|) . and proceed the same 
way with a similar conclusion. The same conclusion also 
remains valid for a = 0. Thus, we have succeeded in 
establishing (T57| in all cases. 



|30j | even for a small system; compare with (J7J for a 
macroscopic system. On the other hand, the microcanon- 
ical entropy need not be concave; see Fig. [SI where the 
bands seen in s(e) have both positive and negative slopes, 
which is in contradiction with (|6ip valid for s(e). The 
non-concave S(E) does not violate the finite system ther- 
modynamics. There is ample evidence that the above 
convexity is also present in the results presented in [27l ]. 
The canonical entropy is the physical entropy for pro- 
teins in its environment and remains concave in Fig. [S] 
as required by thermodynamics. 
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FIG. 6: The canonical equivalence of the entropy s as a func- 
tion of average energy e (continuous curves), and the micro- 
canonical entropy (points) as a function of discreet energy e 
for a given sequence (M = 24) for the three models. We con- 
sider unrestricted conformations here. The energy density has 
been shifted by the lowest energy density eo of each model so 
that the lowest shifted energy density is the same (= 0) for 
all models. We notice a clear band band structure in s in the 
perturbed models (Bi and Ci). The bands become more pro- 
nounced and their separation also decreases, as M increases 
(results not shown). We also see that the native state is al- 
most disjoint from the rest of the bands. This is merely a 
reflection of the energy gap in the standard model A at low 
energies due to finite size of the protein. 



2. Convex Regions in S(E) 

To understand the absence of concavity, we first con- 
sider the standard model A. The energy in this model is 
always negative, so there is no harm in considering the 
entropy as a function of the absolute energy \E\ = TVhh- 
In all cases that we have studied, 5(A?hh) is found to 
be a concave discrete function. The number of states 
W(N RH ) can be partitioned into W{Nsb, N'); see (12TJ1) . 
In the model B, in which the energies are weakly per- 
turbed, e' ~ 0; therefore, most of the conformations in 
W(Nhh) have energies that are close to (— A^hh); some 
of them will have energies that are outside the range 
(— Ahh — 1,— A^hh + 1)- The resulting S(E) associated 
with this A^hh is almost concave, as seen in each of the 
bands in Fig. [HI see the mathematical fits for the two of 
the bands blown up in Figs. and [5] where the mini- 
bands within each of the bands are also evident. This 
then give rise to the lack of concavity or the emergence 
of convexity in the region where two nearby bands over- 
lap. The number of bands equals the number of possible 
values of A^hh in the model A. These convex portions 
of s(e) should disappear and s(e) should approach s(e) 
from below as M — > oo 21]. But for small systems, the 
convex regions persists. The band structure persists for 



all sequences that we have checked. The strongly per- 
turbed energies in the model C provide enough spread 
for each band to strongly overlap, especially at the up- 
per end of the energy spectrum, which reduces the size of 
convex regions. Even here, we find that the band nature 
survives at the upper end of the energies near the maxi- 
mum; the bands at the lower end of the energy spectrum 
continue to persist even for strong perturbation. This is 
clear from Fig. ([6]). Thus, we are confident that convex 
regions in S(E) will exist in any realistic model of small 
proteins. Their presence, however, does not imply any 
phase transition, as S(E) is always concave. This is true 
even though we note from Fig. [6l that there is a clear 
gap between the bands at the lowest energy; see also Fig. 
[22] for a clear evidence of such a gap near the native state 
where we have shown the entropy density for the model 
Bi for low energies. The presence of bands alone and not 
the energy gaps between them give rise to convexity in 
S(E), but not in S(E). One does not need any energy 
gap for a convex S(E) as was the case for the random 
energy model. The energy gaps between the bands in the 
present case are due to the discreteness inherent in small 
systems. As the bands disappear in s(e) in the M — > oo 
limit, there will be no energy gap in this limit, as dis- 
cussed earlier in Sect. IIII Dl 



C. Behavior of S(E) in its bands 

Let us now investigate the behavior of s(e) in these 
bands by finding some smooth fits by neglecting its oscil- 
latory pattern. We consider the two top most bands for 
the weakly perturbed model Bi, which are reproduced in 
Figs. [Jjand[8j respectively, along with the best quadratic 
and cubic fits and their R values. It should be noted that 
the quadratic fit is equivalent to the Gaussian form (|6j). 
provided the coefficient of the quadratic term is negative. 
Because of the nature of each of these bands, this is true. 
If the linear term is positive (negative), then the most 
probable energy Eh within the band is positive (nega- 
tive). From Fig. [7J we observe that the Gaussian fit is 
extremely poor in comparison with the cubit fit; even the 
latter fit is not too good. On the other hand, the result 
for the next band in Fig. [5] shows that both fits are sim- 
ilar in their R-values and that both are poor. This is 
because of the oscillating nature of s(e) in the bands. It 
is interesting to note that the cubic fit is better for the 
top most band than the next lower one. But this cubic 
fit is not a concave function. 



D. Numerical Fits for S{E) and S(E) 

In Fig. [9]we reproduce the ME and CE entropy density 
for the strongly perturbed Ci model (unrestricted con- 
formations); we also show the best quadratic and cubic 
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FIG. 9: Exact ME and CE entropy density for the SP model 
Ci (unrestricted conformations) along with quadratic and cu- 
bic fits. 



the lower end of the energy as M — > oo, as there is not 
an energy gap in our model; see Sect. IIII Dl Hence, to 
conclude an ideal glass transition based on the vanishing 
of the Gaussian fit of the ME entropy is misleading even 
for small proteins. Even the prediction of an energy gap 
is misleading as there are several energy levels between 
the energy E-p and Eq. The presence of a convex region 
in the entropy s (e) in both fits has nothing to do with 
any phase transition as the canonical entropy does not 
show any signature of a transition, as is clear from the 
figure. 

The fits for the CE entropy are given by 

s = 0.8236 + 0.9162e - 2.6895e 2 - 0.279506 3 (R = 0.9843), 
s = 0.5711 + 0.8511e - 0.5434e 2 (R = 0.9994). 

For the CE case, the quadratic fit is the better one; how- 
ever, both fail in the low energy range. Thus, these fits 
also do not do justice to the native state. It should be 
noted, however, that both fits yield a positive CE entropy 
at all energies e > e$. 

In Fig. (fTU)) , we show the entropies and their best 
quadratic and cubic fits for the weakly perturbed model 
Bi. For the ME case, we have 

s = 0.4341 + 1.1203e - 0.8863e 2 - 1.9502e 3 (R = 0.9663), 
s = 0.4406 + 0.9705e - 1.1454e 2 (R = 0.9623). 

Again, both fits are poor at the lower energy range; oth- 
erwise, they are very similar in their R-values. The Gaus- 
sian fit again predicts an energy gap, just as was the case 
for the strongly perturbed model in FigJ^l and has no 
significance for any folding transition. The exact dis- 
crete entropy s(e) does show an energy gap between the 
two lower bands, which is expected to disappear in the 
limit M — > oo. The prediction of negative ME entropy 



FIG. 7: Entropy fit for the last band for the model Bi with 
M = 22 and unrestricted conformations. 
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R 2 = 0.7545 




FIG. 8: Entropy fit for the next to the last band for the model 
Bi with M = 22 and unrestricted conformations. 



fits along their R values. The fits for the ME entropy are 

s = 0.3852 + 0.9875e + 0.1136e 2 - 1.182e 3 (R = 0.9594), 
s = 0.4592 + 0.8717e - 0.8221e 2 (R = 0.9197). 

It is clear that between the two, the cubic fit is the better 
fit overall. However, both fits are extremely poor at the 
low energy end, which is the relevant range for the folded 
or the native state. Thus, the quadratic fit, which as said 
above is the Gaussian form ([BJ, is not suitable to describe 
the ME entropy. Moreover, the quadratic fit gives rise to 
the vanishing of the entropy at an energy higher than the 
lowest allowed energy eo, which is most certainly not true 
of the exact entropy, which is everywhere non-negative 
(e > eo). It is not possible for the entropy to vanish at 
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FIG. 10: Exact ME and CE entropy density for the WP model 
Bi (unrestricted conformations) along with quadratic and cu- 
bic fits. 
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FIG. 11: Densities for M — 16 with only H residues (unre- 
stricted conformations) . The energetics belong to the strongly 
perturnbed case, with only — 1, but all other elements in 
e' are zero. 



from the Gaussian fit is unphysical as above for the same 
reason, and cannot be taken seriously. 
For the CE, the fits are: 

s = 0.7029 + 0.9772e - 1.2906e 2 - 0.8344c 3 {R = 0.9995), 
s = 0.7020+ 1.0448e- 1.3558e 2 {R = 0.9994). 

The behavior of the two fits are similar to that for the 
strongly perturbed case above. Once again, the ME 
entropy fits give non-negative entropy for all energies 
e > e . 

VIII. ENERGETICS EFFECTS ON DENSITIES 
AND SPECIFIC HEAT 

A. Densities and Energetics 

To understand the effect of bending only due to semi- 
flexibility, we consider an unrestricted protein with M = 
16 that belongs to the strongly perturbed case; see 
Fig. 1111 The only non-zero energies are eb = 1, and 
chh = — 1. All other energies in e' are zero. Thus, we are 
considering the model C2. Furthermore, all residues are 
H; there is no P residue. This means that npp = np-yv = 
at all temperatures. There is a unique native state in 
which the protein bends around in a double strand 

RRRRRRRDLLLLLLL 

with A^b = 2, and JVhh = 7 so that it has the energy 
E = —5. Other quantities of interest are: N p — 7, N^ p = 
l,N h = 0,N HW = 20, and N PH = 0. The fact that the 
entire protein is exposed to the water is understandable, 
as there is no interaction with water in this case. This 



is the state of the protein at T = 0. As T is raised, the 
various densities behave as shown in Fig. QTJ It is not 
surprising that tihh mostly decreases monotonically due 
to the penetration of water inside the protein. 

What one notices from the figure is that around T ~ 
0.5, there is not only a sudden rise in the helix density, 
a sudden drop in the parallel bond pair and HH-contact 
densities, but also a minimum in the HW-contact density. 
This minimum is due to the bending penalty as we now 
discuss. As said above, the native state corresponds to 
a double strand (A^hh = 7, and Ahw = 20). This state 
does not have the maximum HH-contact, which happens 
in a compact state (A^hh = 9). However, the compact 
state corresponds to at least 4 additional bends (Ab = 6), 
so its energy is at least E = — 3, and is higher relative to 
the native state. At higher temperatures, the compact 
state, which has higher entropy, becomes more stable. 
This heuristically justifies the dip in tihw- While it is not 
noticeable in the figure, «hh has a maximum (= 0.4539) 
at T = 0.42, exactly where the dip is in nnw (= 1-2172 
at T = 0.42). 

To understand the effects of the energetics better, we 
now give the results for a M — 24 protein with a fixed 
sequence xo- We consider the weakly perturbed model 
Bi. As said above, there are two native states related by 
the symmetry transformation (|30|) . In the native state, 
we have N = (18,12,9,7,10,4,0), and E = -446/50. 
The results for the densities as a function of T are pre- 
sented in Fig. [T3J We observe that the rate of npn rise 
is maximum around T = 0.58; in the neighborhood of 
this temperature, almost all densities in Fig. [T^] have 
some unusual behavior. For example, nun has a rapid 
drop around this temperature. Other densities seem to 
have a plateau around this temperature. As a matter 
of fact, all densities have an inflection point around this 
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FIG. 12: Densities for M = 24 protein (unrestricted confoi 
mations). The energetics belong to the weakly perturnbe 
case. The sequence is prseset to the repeatation of PHHP. 



temperature. 



B. Shifted Energy, Excitations, and Energetics 

An interesting property of the three models should be 
noted from the above Fig.©. The three entropies s(e) 
are drawn in such a way that the upper end of each of 
them corresponds to T = 4.0. What we see is that the 
corresponding shifted energies in the three models satisfy 
the following inequality: 

£c(4.0)-£ c (0) > £b(4.0)-£b(0) >E A (i.0)-E A (0). 

(62) 

Thus, at high temperatures, the excess energy above the 
native state of a given model is highest for the strongly 
perturbed model and lowest for the unperturbed model. 
This should not be taken as to mean that the heat capac- 
ity of the strongly perturbed model is the highest. We 
will return to this issue later. 

We see from (1521 that the excess energy of the strongly 
perturbed model is the highest at T = 4. In Fig. [T2J we 
report the exact excess energies E(T) — E(0) for the three 
models, A, Bi, and Ci (M = 24) on an infinite lattice. 
We see that the behavior changes at low temperatures, 
where the inequality of (|62[) is completely reversed. In 
other words, there are more excitations in the unper- 
turbed model than in the perturbed models. This means 
that the net effect of the perturbations is to make the 
native state more robust to perturbations: The pertur- 
bations stabilize the native state to higher temperatures. 




Temperature 

FIG. 13: Shifted average energies for the three models for 
M — 24 (unrestricted conformations). 



C. Energy Fluctuations or Specific Heat 

We report the energy fluctuations in Fig. [H] for the 
three models (unrestricted M = 24 protein). The fluc- 
tuation is related to the specific heat in the model; see 
(|47ll . The peaks in these fluctuations suggest strong fluc- 
tuations due to cooperativity in the models and are lo- 
cated at the inflection points in the average energies. As 
is known, these peaks usually provide a clue to an im- 
pending thermodynamically sharp transition in the ther- 
modynamic limit. To understand such a claim better, we 
also report in the same figure the energy fluctuation for 
the unrestricted protein (M — 22) in model A. The peak 
of this fluctuation is somewhat lower in height than the 
corresponding peak for M = 24, thus suggesting that the 
peak height has increased with the protein size M. Stan- 
dard statistical mechanical arguments require the energy 
fluctuations in the energy density to decrease with the 
size M for macroscopic systems as follows: 



(Ae) 2 oc 1/M. 

Thus, the fluctuations behave differently for small sys- 
tems. The increase, however, is not very much, suggest- 
ing that the peaks may not diverge as will be the case for 
a continuous folding transition. We expect the folding 
transition to be a discontinuous one in the thermody- 
namic limit. However, more work is needed to settle this 
point. 

The locations of the peak for the weakly perturbed 
model is at higher temperatures than the temperatures 
around T = 0.58, where the densities show unusual be- 
havior. This is most probably due to the finite size ef- 
fects, and should not be surprising. 
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FIG. 14: Energy fluctuations (Ae) 2 for the three models for 
M — 24 (unrestricted conformations). For comparision,we 
also show the fluctuation in the standard model for M = 22. 
We clearly see that the fluctuations become stronger as M 
increases, but the postion of the peak does not shift much. 



IX. CONFORMATIONAL SPACE AND DIS- 
TANCE 

A. Distance Matrix 

1. Conformations or Microstates and Configurational 
Space C 



a doublet r = (x, y) with the location of the root given 
by the doublet (0,0). Because of the choice of the lat- 
tice spacing (a = 1), the coordinates x,y are integers. 
The conformation of the protein is uniquely given by the 
ordered sequence of the doublets R = {r^ } , where the 
residue a(= 1, 2, M) is located at the lattice site r^ Q '. 
We also require the first bond of the protein to be in a 
fixed direction. Each ordered sequence R specifies a pro- 
tein conformation, a microstate, T uniquely. There are 
altogether W distinct conformations or microstates. In 
the following, we will also use state to simply refer to a 
microstate or a conformation. 



2. Distance between Conformations 

The distance between two conformations R and 
R'= |r'( Q '} is defined here to be the Euclidean distance 



d(R, R) 



\ 



M 

E 



[r(«) 



r /(a)12 



The distance provides useful information not only about 
the topology of the energy landscape but may also be rel- 
evant for the dynamical description of the folding process 
(even though we are not presently interested in the dy- 
namics) by introducing the concept of a neighborhood of 
a point in the conformation space C: two conformations 
are neighbors or are connected in C if their separation is 
less than or equal to some chosen distance. 



For monomeric systems, in which each monomer is 
treated as a particle, the energy landscape is easy to 
characterize. One labels the monomers a(= 1,2, ...,M) 
so that each monomer has a unique index a. Then one 
considers their positions r' a ). The ordered set 

R = {r«,r (2) ,r (3) ,-,r (M) } 

specifies a point in the 3M-dimensional configuration 
space C, and the energy E associated with this configu- 
ration then defines the energy landscape in a (3M + 1)- 
dimensional hyperspace. As only the ordered set R is 
used, permutation of particles positions is not allowed. 
Thus, each point in the configuration space represents a 
distinct microstate of the system. The ordered nature 
of the set R also takes into account for the connectiv- 
ity of the protein: a residue occupying the lattice site 
r( fc ) is connected to its neighboring residues located at 
lattice sites r( fc_1 ) (for k > 1) and r (fe+1) (for k < M). 
To see this most easily, we proceed as follows. We take 
the C-terminus of the protein to be the starting point of 
the sequence. We index the starting point as the first 
residue, which is used to root the protein. Each succes- 
sive residue in the sequence is, hereafter, given an index 
increasing by one, until the last residue is given the index 
M. The location of a site on the lattice is also given by 



3. Distance or Neighborhood Matrix 

The distance rf(R, R') can be used as an element to de- 
fine a W x W distance or neighborhood matrix T>, whose 
diagonal elements are the only elements that are 0. All 
other elements are non-zero. Thus, T> is not going to be a 
sparse matrix. The distance between a compact confor- 
mation and a completely extended conformation will be 
among the largest. The shortest distances will usually be 
between two conformations that differ in a few elements. 
For example, assume that only the elements r^ M ' and 
r '( M ) differ. The two elements can only differ in one of 
its components, and that too by only one lattice spacing. 
Thus, the distance between these two conformations will 
be 1. It is also possible that two conformations differ in 
only one interior element at the position k ^ M. The vec- 
tors r( fe ) and r'^ must differ in each of their components 
by 1. Hence, the distance between these conformations 
will be s/2. 

4- Native (0) and Stretched (S) States 

As W is usually a large number, it is not possible to 
study the entire matrix T>. Therefore, we will consider 
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the distance of each conformation from two selective con- 
formations, viz. the native state (to be denoted by in 
the following) and the completely stretched state (to be 
denoted by S in the following); the latter is the confor- 
mation in which the protein is given by the string of only 
R steps 

RRR 

so that the conformation is completely in the horizontal 
direction. If the native state is not unique, we pick the 
first one of the generated native states. The stretched 
conformation is unique in that it does not depend on the 
energetics. On the other hand, the native conformation 
depends strongly on the energetics and, therefore, is not 
unique as far as different energetics are concerned. This 
feature makes the stretched conformation a desirable ref- 
erence state. This state can be used to compare proteins 
with different energetics. We denote the two distances by 
do(R) (from the native conformation) and cfe(R) (horn 
the stretched conformation), respectively. In most cases 
of interest, there is a unique native state for a given en- 
ergetics. It is the standard model which invariably gives 
rise to degenerate native states. 

Let the set R/= |r| a -'| denote the two reference con- 
formations (I = 0,S), and 



<*(R) = 



\ 



M 



a=l 



the distance of some conformation T specified by the 
set R from R;. This distance of a conformation of en- 
ergy -E(R) gives information about how close that con- 
formation is to the native state. Thus, we can clas- 
sify each conformation by its distance di and energy E 
and present them in a two-dimensional plot as in Figs. 
U5ll6ll7llg|2UITgi and [H In all these plots, we have 
shifted the energy so that the native state energy is at 
0, so that we can compare the configuration space C of 
proteins with different energetics. Moreover, we only con- 
sider one of the native states if there are several native 
states to save computational time. In this sense, our re- 
sults are not complete in such cases. Therefore, we also 
present the result for a weakly interacting M = 16 pro- 
tein of a sequence for which there exists only one unique 
native state so that we can compare this complete case 
with the incomplete case. We will find that there is no 
dramatic difference. 



5. Reduction of C to a 2- dimensional plane C21 

The use of the two reference states will provide us with 
two distinct but partial perspectives of the configuration 
space C by projecting it on a lower dimensional space. 
Let us consider the perspective of C while looking at it 
from the native state. The projected plane is denoted 



by C20 .Imagine the energy distribution of conformations 
that are a distance do from the native state. All these 
states are on a hypersphere of radius do and have various 
energies. Let us further coalesce all of the conformations 
of a given energy E that lie on this hypersphere to a sin- 
gle point. We will use W(do,e) to represent the number 
of these conformations associated with the single point 
in C2o- Such a transformation allows us to transform 
C to a two-dimensional surface C20 on which a point is 
represented by (do,e). On such a plane, a constant en- 
ergy line represents the equipotential conformations at 
various distances from the native state. All these confor- 
mations are at the same height (from the native state) in 
the energy landscape. A constant do line represents all 
conformation with various energies that lie on a hyper- 
sphere centered at the native state. A similar reduction 
from C to the two-dimensional surface (ds,e) provides 
another perspective of the energy landscape. We will use 
W(dg, e) to represent the number of conformations asso- 
ciated with the single point in the above coalescing on 
the ((is j e) plane. The projected plane is denoted by C2S- 
It is obvious that 



W(e) = J2W(di,e), l = 0,S, 



(63) 



so that the two perspectives only differ in the way W(e) 
is partitioned into W(di, e) by the distance di. The total 
number of microstates W(e) remains the same in the two 
perspectives. In addition, the allowed energies also do 
not change in the two representations of C. 



B. Standard Model 

The first two figures, Figs. [15] and Fig. [T|5] are for 
the standard model. Fig. [15] shows the energy density 
distribution vs. do (red circles: C20) or ds (blue triangles: 
C2s) ; respectively; they are two possible perspectives of 
C. The two conformations at d = in Fig. 1X51 represents 
the native conformation (red circle at e = 0) and the 
extended state (blue triangle at e = 3/8) that are used 
as the origin of the distance for the two perspectives, 
respectively. We observe that both the maximum and the 
minimum do increase with e, the former more so than the 
latter. However, while the maximum ds increases with 
e, the minimum dg decrease with e. We observe that the 
maximum do, to be denoted by do, max, is about 120, while 
the maximum ds, to be denoted by G?s,max, is about 200. 
As said above, the number of conformations W(e) for a 
given energy, and the allowed energies (7 in total) are 
the same for both distributions. The left axis shows e 
for the red circles and the left axis for the blue triangles. 
The left axis has been shifted by 0.02 so that the two 
colors do not overlap. In Fig. [111 we show the 3-d plot 
d — e — W(d, e) as the projected energy landscape built on 
C20 (red circles) and C2S (blue triangles). The energies 
for blue triangles has been shifted by 0.02 so that the two 
symbols will not overlap. We observe that for a given e, 
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W(d,e) has a single peak in C20 (red circles), while it 
has several peaks in C2S (blue triangles). Moreover, the 
peaks in C20 rise and move away from the native state 
as we approach higher energies. 

Because of the sum rule (|53"]) and the fact that the al- 
lowed d-range of ds is much larger than of do, it is not 
surprising that W(do,e) is much higher near its peak 
than W(ds,e) near its peaks. It is clear that many high 
energy conformations are far from the extended confor- 
mation of the same high energy, but most of these high 
energy conformations are closer in distance from the na- 
tive conformation. This suggests a very open landscape 
for the standard model with the native state in the mid- 
dle, and which continues to narrow down with decreasing 
energy. We also note that C2S is more symmetric than 
€20- We also observe that the native state is around 
ds — 90 from the stretched state (e = 3/8); see blue tri- 
angles. It follows from the figures that there are several 
other states of energy e = 7/16 that are much closer to 
the native state. Indeed, there are high energy states as 
close as about do = 10. 

We comment on some interesting features that is ap- 
parent in the figures. Consider Fig. 1151 The best way 
to understand this figure is to imagine drawing a (hy- 
per)circle of radius d with its center at the chosen native 
state. Now draw a (hyper)cylinder on this circle along the 
energy direction. Then the microstates that lie on this 
cylinder are the microstates (red circles) that appear on 
the vertical line drawn at the distance d (from the na- 
tive state) in Fig. [15] All of these microstates are on the 
cylinder of radius d, but the distances between them may 
be much different from d. In fact, some of them may be 
closer than d, while others may be farther apart. 

The conformation closest to the native state in Fig. [T5] 
is not at e = 1/16, but another native state. Thus, there 



FIG. 1 



FIG. 15: E vs. ds distribution for for M — 16 Model A (unresl 
protein (unrestricted conformations). 




E vs. do distribution for M — 16 Model A protein 
icted conformations). 



is no energy barrier between these two microstates (at 
the same energy). However, there are other microstates 
at the lowest energy that are widely separated in the ra- 
dial direction of do - The same is true of states at e = 1/16. 
(At higher energies, the microstates are almost dense in 
do, so that can be treated as connected in that they lie 
on neighboring cylinders.) Consider the microstates at 
e = 1/16. Between various separations (in the direction 
of do) in these states exist many higher energy states at 
e = 2/16. This is true in other figures also. Thus, this 
feature appears to be generic. But this is true only of 
the lowest lying microstates. The microstates at higher 
energies are connected in the sense note above. Thus, 
the energy barriers in the radial direction exist only for 
low-lying states. There are no barriers in the radial di- 
rection for highly excited states. This does not imply 
that there are no barriers in other transverse directions 
in the configurations space C. The implication of this 
for the possible dynamics can be easily appreciated if we 
recognize that only local moves are possible in a suitably 
chosen short duration r. During this time r, the protein 
can only change its conformation to a new conformation 
that is nearby in distance. Thus, in the process of fold- 
ing, the protein will more efficiently move to the native 
state from e = 2/16 than from e = 1/16, if the former 
is closer to the native state than the latter. We will not 
pursue this point further here as we are only considering 
equilibrium properties in this work. We hope to return 
to this issue in a future contribution. 



C. Weakly Perturbed Model 

The energy density e in the standard model changes 
by a non-zero but appreciable amount Ae = 1/16. This 
can be made smaller by introducing other energies in the 
model. For the model B2, the results are shown in Figs. 
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FIG. 18: E vs. d s distribution for M = 16 Model B 2 protein 
(unrestricted conformations) for the sequence \o- 



FIG. 17: E vs. d distribution for for M — 16 Model B 2 
protein (unrestricted conformations). 



(fT71 and fl8|) for the sequence xo- In Fig- [13 we show 
C20 along with the distribution W(do,e) for some se- 
lected distances d = 20, 30, 40, 50, and 60. In Fig. [H we 
show C2S- The discrete band structure of Figs. (fl~5] and 
IT6"| still persists even to the higher energies, except that 
Ae is smaller, and the energy spectrum begins to look 
more continuous at higher energies. At energies close 
to the native state, the spectrum is still very much dis- 
crete. Otherwise, the features of the model A have not 
disappeared. For example, the symmetry in C2S is still 
present; see Figs. (fTol and [IS]). In Fig. Q1JJ we show 
the result for a weakly interacting Model B2 M = 16 
unrestricted protein for the following sequence: 

X : PPPPHHPPHHHHHHHPP. 



FIG. 19: E vs. ds distribution for M = 16 Model B 2 protein 
(unrestricted conformations) for the sequence \ '■ PPPPHH- 
PPHHHHHHHPP. 



In this case, there is only one unique native state, which 
is given by the string 



RRRDDDLUULDLULD 



D. Strongly Perturbed Model 



starting with the first residue. The energy of the na- 
tive state has N = (9,6,4,2,4,4,4), and E = -117/56. 
However, a comparison with Fig. [TBI shows that the dis- 
tributions of states in C2S for the two cases are almost 
the same, except at low energies. Thus, we believe that 
our incomplete results are not different from the complete 
results at intermediate and higher energies. 

The distribution W(d n , e) in Fig. [T71 shows that it has 
an oscillatory pattern and that the highest peak in it has 
a maximum around do =60, and e = 0.7. 



The projected conformation spaces C20 and C2S for 
model Ci are shown in Figs. [20] and [21] respectively. 
We again see the symmetry present in the distribution 
of states in C2S- The energetics is such that there is a 
strong mixing of levels to the point that the clear cut 
band pattern is completely absent at high energies; their 
discrete nature is still present near the bottom. The en- 
ergetics change the native state so that its distance from 
the extended state are different in the three models. 
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the hypercylinder cuts the landscape is given by the sum 



FIG. 20: E vs. do distribution for M 
(unrestricted conformations) . 



16 Model Ci protein 



FIG. 21: E vs. ds distribution for M 
(unrestricted conformations). 



16 Model Ci protein 



E. Small System Energy Landscape and Convex- 
ity of S{E) 

The distribution of points in the C2 plane allows us 
to draw certain conclusions about the form of the energy 
landscape. Assume that the energy landscape is a single 
inverted cone of a fixed (hyper-solid)angle. In that case, 
all conformations of a given energy E will have the same 
radial distance from the native state in C, and the energy- 
distance distribution in C20 will be represented by points 
that lie along a single straight line at a fixed angle in the 
e — do plane: For each energy, all states are collapsed into 
a single point on this line. This is most obviously not the 
case here in any of the C20 for the three models shown 
here. Consider the standard model in Fig. (TTS]). We see 
that there are a few allowed energies at a given distance 
do from the native state. If we draw a hypercylinder of 
radius do, then this cylinder will cut the landscape at 
these energies. These energies are at different angles so 
they lie on different cones making different angles at its 
apex located at the native state. The number of points 



W(d Q ) = Y,W(d ,E), 



where W(do, E) is the number of conformations of energy 
E that are at the radial distance do from the native state. 

Because of conformational changes during folding, the 
folding is believed to be governed by the multiplicity 
W(E), which in turn governs the energy landscape [2l| : 
each point on the hypersurface represents a conforma- 
tion. The lack of concavity discovered here has a pro- 
found effect on the shape of the landscape. It no longer 
narrows down as E decreases. It will be interesting to 
pursue this point further. This is beyond the scope of 
the present work, but we hope to consider it elsewhere. 
It is evident, and as discussed above, several different N 
will usually mix together for a given E, except in the 
model (A) [in which E — — Ahh]- There will be a cer- 
tain landscape topology for the standard model, which 
will change with e'. From (|20p. it is evident that the 
landscape will become narrower for e' 7^ 0. At the same 
time, the total "surface area" W of the landscape will not 
change (even though the allowed energies change) with 
e'. It is possible that it is this narrowing at constant W 
that makes the approach to native state more directional 
with the consequence that it would be fast. This issue 
needs to be probed carefully. 

Since it is CE that is relevant for a real protein in its 
environment, it is the canonical multiplicity 

W C e(E) =exp[S(E)\ 

that is relevant for folding. As shown above in (I61|) . it 
continuously increases with E, until we reach at infinite 
temperatures. Thus, the narrowing of the landscape with 
non-zero e' may not be as relevant for protein folding as 
the observation that W C e(E) > W(E). From (J58|), we 
observe that Wce(E) gets contribution from all confor- 
mations, not just the conformations W associated with 
E. In particular, it also includes the contribution from 
the native state(s) though its probability is going to be 
small unless we are at very low temperatures. Thus, it 
is misleading to think that a small protein at a given T 
only probes average conformations W(E) when in equi- 
librium. As T is reduced, the protein continues to probe 
all conformations although the probability for conforma- 
tions of lower energies increases. It would be interesting 
to pursue the consequence(s) of this observation. 



X. FREE ENERGY LANDSCAPE AND 8S(E)/dE 
A. Free Energy Landscape 

Let us consider the implications of the non-concavity 
of S(E) on the free energy functional 



F(E,T) = E -TS(E), 
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FIG. 22: The free energy functional f(e,T) = F(E,T)/M 
at three different temperatures T = 0.9, 1.0, and 1.1 for the 

model Bi. We also show the entropy s(e) for low energies. FIG. 23: A quadratic and cubic fit for the ME entropy for 
The free energy functional describe the free energy landscape M = 24 Model A. 
at a given temperature T. The functions are discrete and the 
curves are drawn through their points only as a guide for the 
eye and to clearly show the undulations in them. 

B. Lack of Physical Significance of Global Mini- 
mum of F(E, T) 



which should not be confused with F(T) introduced ear- 
lier in ([50")) and (|5Tj) . The later represents the free energy 
of the equilibrium state of the system. It is a monotonic 
function of T, and because E is monotonic in T, it is 
also a monotonic function of E. On the other hand, the 
functional F(E, T) is defined at any T as a function of 
E. Thus, it is also defined for energies different from the 
equilibrium energy at T . For a macroscopic system, it 
is well known that one must minimize globally F(E,T) 
with respect to E at fixed T to obtain the equilibrium 
free energy F(T) = F(E, T) evaluated at the minimum. 
For continuous functions, this minimization is equivalent 
to |(61]) for S(E) : 

dS(E)/dE = 1/T. (64) 

It is this relation that was used for the Gaussian 
entropy ((6]) to obtain the Gaussian energy relation ([8|) 
earlier. The above discussion makes it clear that the 
derivation given there was valid for a macroscopic system, 
and not for a small system. 

At a given temperature, the free energy functional 
F(E, T) describes, what is customarily called the free 
energy landscape at that temperature with the energy E 
playing the role of a reaction coordinate of the landscape. 
We show in Fig. [221 this landscape at three different tem- 
peratures T = 0.9, 1.0, and 1.1 for the weakly perturbed 
model B x (unrestricted protein with M = 24). We have 
also shown the entropy density at low energies, which is 
a blow up of the entropy shown in Fig. [SJ 



The global minima of the three landscapes occur at 
e = -0.0783, -0.3717, and 0.0742, respectively. (e Q = 
—0.3717.) The depths of the minima are, respectively, 
/ = -0.4410, 0, -0.4006, and -0.5309. That the energy 
of the global minima and their depths as a function of 
temperature have no thermodynamic significance is ob- 
vious when we recognize that these energies and free en- 
ergies are not monotonic in T, whereas proper thermo- 
dynamics requires them to be monotonic even for small 
systems. It is interesting to compare these energies and 
the depth of the free energy minima with the exactly 
computed average energy density e(T) and the free en- 
ergy f(T). The computed average energies at these tem- 
peratures are -0.0479,0.0054, and 0.0493, while the free 
energy densities are —0.6294, —0.697, and —0.7694. 

What is striking is the tremendous error in the com- 
puted values and those obtained by the application of 
macroscopic thermodynamic principle to small proteins. 
Neither the location nor their depth are close to the exact 
computed values. This is a sobering realization of the ef- 
fects of the finite size of the protein on thermodynamics. 



C. Error in Using 8S(E)/dE = 1/T 

We consider the unperturbed model A for an unre- 
stricted protein (M = 24), whose ME entropy density as 
a function of the energy density is reproduced again in 
Fig. [221 As it is a discrete function, we cannot calcu- 
late its derivative to see if (f64|) is valid for small systems. 
However, it is possible to find a continuous fit for s(e). 
We have shown a quadratic and a cubic fit in Fig. [231 
along with the i?-values. They are respectively 
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FIG. 24: Calculation of T using (J64J for the cubic fit of the 
ME entropy in Fig. ((23} and by using the CE entropy. 



s = 0.8737 + 0.5484e - 3.0151e 2 ; R = 0.9987, 

s = 0.8650 + 0.2145e - 5.1161e 2 - 3.361e 3 ; R = 0.9990, 

and can be used to calculate the inverse derivative 
dE/dS, which is plotted in Fig. [M] as the blue curve, 
along with the inverse derivative dE/dS as the red curve. 
Here, we have used the cubic fit for the calculation of 
dS/dE. The difference shows the error caused by using 
([5^)1 to calculate the inverse temperature. The correct 
temperature is given by the red curve. We find that the 
correct temperature from CE is lower than the incorrect 
temperature from ME at lower energies, with their na- 
ture reversed at higher temperatures. In other cases, it 
is also possible to observe the opposite relation for the 
two ways of computing the temperature. 

XI. DISCUSSION AND CONCLUSIONS 

We have considered a lattice model of a small pro- 
tein as a semiflexible copolymer in its solvent environ- 
ment. The copolymer is random due to possible forms 
of its residue sequence, but this randomness is consid- 
ered frozen (quenched). The model presented here is an 
extension of the original model of semiflexible homopoly- 
mer due to Flory; this extended model has been investi- 
gated recently by us. However, the model requires a very 
important modification because of the heteropolymer na- 
ture of the protein. Here, we have restricted our investi- 
gation to an incompressible copolymer representation of 
the protein. Another novelty is to restrict the analysis to 
a single protein size of a finite size M. Our previous in- 
vestigation has involved either an infinitely long polymer 
or an infinite number of finite polymers. Thus, studying 



small system effects on the statistical mechanics of the 
protein has been a central feature of this investigation. 

Our aim is to study exactly the statistical mechanics 
of the general model. For this, we take the approach of 
exact enumeration in which we count exactly the num- 
ber of conformations of the protein by anchoring one of 
its ends, the C-terminus, at the origin of the lattice. We 
consider a square lattice and use its lattice symmetry to 
generate only those conformations whose first step from 
the origin is in the horizontal direction. This reduces 
the number of conformations by 4. We also allow the 
first bend only in the downward direction, but not in 
the upward direction to further reduce the number of 
conformations that we generate. We consider two differ- 
ent kinds of conformations for enumeration. We cither 
consider only compact conformations or consider unre- 
stricted (compact and non-compact) conformations, and 
generate all conformations under the above two restric- 
tions due to lattice symmetry. For compact conforma- 
tions, we have considered M < 64, and for unrestricted 
conformations, we have considered M < 26 so that the 
enumeration can be done in a reasonable amount of time. 
As real protein interactions are not well-understood, we 
have considered three different model energetics to study 
the effects of energetics on protein thermodynamics. One 
of the models (Model A) is the standard model, while the 
other two are obtained by weak perturbation (Model B), 
and strong perturbation (Model C). 

Using plausible arguments under some very mild as- 
sumptions, we show that these models have no energy 
gap for M — > oo, even though there appears to be some 
gap in the case of small proteins. Indeed, an energy gap 
is not the only way a discontinuous folding transition can 
occur. The latter is known to occur even in the absence 
of an energy gap such as the Flory model of semiflexi- 
ble homopolymer as shown recently by us. However, the 
presence of a gap endows the microcanonical ensemble 
(Boltzmann) entropy S(E) with non-concavity. For a 
macroscopic system, such a non-concave entropy implies 
a discontinuous folding transition. Thus, it is the non- 
concavity that drives the discontinuous folding transition 
and not the energy gap. However, we demonstrate that 
it is the canonical ensemble equivalent entropy function 
S(E) that shares the concavity requirement for small or 
macroscopic systems; the canonical entropy S(E) is not 
required by thermodynamics to be concave. Moreover, 
we prove that S(E) > S(E).Oui exact enumeration con- 
firm these facts. We show that a Gaussian fit is not very 
good for exact entropies that we calculate, especially at 
low energies, the energies most relevant for the folding 
transition. The Gaussian fit invariably gives rise to neg- 
ative entropies that are then avoided by advocating an 
energy gap. This is despite the fact that the exact enu- 
meration never leads to a negative entropy. Thus, the 
usefulness of the random energy model for small proteins 
is highly questionable. 

It is plausible that infinite random copolymers are self- 
averaging. As a consequence, all thermodynamic densi- 
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ties are the same for almost all sequences. However, we 
find that small proteins are far from being self-averaging. 
Therefore, as is commonly believed, the protein sequence 
is extremely relevant for its proper or desired function- 
ing. In other words, we cannot overlook the importance 
protein sequences have in determining the native state. 
Also, as expected, various densities such as the entropy 
and energy densities retain a strong dependence on M; 
this dependence should not be neglected. While a small 
protein is not supposed to show a sharp folding transi- 
tion, a signature of a rounded folding transition appears 
in the peak in the specific heat. 

We introduce a notion of a distance between conforma- 
tions and show how the multi-dimensional configuration 
space C can be mapped onto a two-dimensional config- 
uration space C21J =0,S. These two-dimensional projec- 
tions provide a glimpse of the form of C, and from which 
we obtain some limited perspective of the energy land- 
scape. We also calculate the free energy landscape by 
using the energy density as the coordinate. These free 
energy landscape appear very flat with undulations that 



are not very high. 

We have also shown that applying thermodynamic 
relations that are valid for macroscopic systems to 
small system microcanonical entropy will cause errors 
in estimating thermodynamic properties, and should be 
avoided. 
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